areata anes 


cn So * The Journal of 
iene: Psycholo oy 


Devoted Primarily to the Scientific Study of Problems of 
Learning and Teaching. 











BOARD OF EDITORS: 


DLD ORDWAY RUGG, Cheirman RUDOLY PINTNER, ss sinreie 
pln School of Teachers College. eachere College, Columbia University. 
College, Columbia University. 

BEARDSLEY RUML, 


8 CARLETON Carnegie Corporation, New York City. 
Training School for Teachers. LEWIS MADISON TERMAN, 
K NUGENT FREEMAN, ictand Geangers Catena. 


a EDWARD LEE THORNDIKE, 
JR IRVING GATES, Teachers College, Columbia University. 
re College, Columbia University. queue 
ALLEN CHARLES HENMON, LAURA ZIRBES Assistant Editor. 
ni ibs Wisconsin. Lincoln School, of Teachers College. 








OCTOBER, 1923 of 008 year. 


— 
rT 








CONTENTS 


Stupy OF THE Downey TEST BY THE METHOD or ESTIMATES. 
Norman C. Meier 








HE JOINT YIELD FROM TEAMS OF TEsTs. Clark L. Hull 


HE ADVANTAGES OF THE PROBABLE ERROR OF MEASUREMENT 


AS A CRITERION OF THE RELIABILITY OF A TEST OR SCALE. 
P. H. Nygaard 


HE VALIDATION OF INTELLIGENCE TESTS (CONTINUED). 





REDICTING AcaDEMIc Success. Mark A. May 


CoMPARISON OF IQ’s OBTAINED WITH DEARBORN GROUP 
TESTS AND THE STANFORD Revision. Frank S. Freeman... 441 


EW PUBLICATIONS IN EDUCATIONAL PsycHOLOGY AND RELATED 
FIELDs oF EDUCATION 





Published Monthly Except June to August by 
WARWICK and YORK, Inc., 
York, Pa. Baliimore, Md. 











i Class matter November 15, 1921, at the Post Office at York, Pennsylvania, under the Act of Mareh 8, 1879. 















r 
‘ 
x 





The Journal of 
Educational Psychology 


is published monthly, by Warwick & York: 
Inc., Baltimore, Md., and York, Pa. The issues 
for 1922 make Volume XIII. Title page and 
index are bound in December number. 

Manuscripts for publication, books or other 
materials for review, and news items should be 
addressed to Harold Ordway Rugg, Chairman of 
the Editorial Board, 59 Edgecliffe Terrace, 
Yonkers, New York. 









ii Single issues are 60 cents each, and less than a full 





for every month ordered. 







I] take effect. No claim for non-receipt of an 





weeks after receipt of the next succeeding issue. 


Warwick & York. Inc., 












The price of the Journal is $4.00 a year, payable | 
| in advance. Foreign postage is 30 cents extra. | 


year’s subscription is charged at single issue rate 
| Subscribers should notify the publishers of [f 
#| change in address at least four weeks before | 
Mi publication of the issue with which change is to | 


I issue will be entertained unless made within two | 






TEACHERS’ 
HEADQUARTERS _ 















} European 
Plan 


Centrally 
Located 


EDWARD DAVIS 
Manager 


HOTEL RENNERT 


BALTIMORE, MARYLAND 








A Study in Democracy 
By 


Hernraicn E. BucuHorz 
(Ezekiel Cheever) 





264 pages 
“Cloth binding 
$2.00: postage 10¢. 





OF WHAT USE ARE COMMON PEOPLE? 


CONTENTS, 


PREFACE 
I. Browine Straws 
II. A Homespun CriTERION 
III. Rigutrs anp Wronas 
IV. Tue Bopy Pouiric 
V. Tae ConvenTIONAL Majority 
VI. Moutitnovus MrinoriT1z8 
VII. Democracy In PRACTICE 
VIII. Freepom AND THE INDIVIDUAL 
IX. EpucaTion AND THE CoMMONALTY 
X. CoMMONPLACE ATTITUDES 
XI. Fauuwmitry or JuDGMENTS 
XII. ArT AND THE VULGAR 
XIII. Vanity or MANNERS 
XIV. Pumosorpxy anp MeEpiocritTy 
XV. Repropucine Terr KInp 
XVI. Fottow tHe LEADER 


XVII. Finpine a BULL’s-Eyz 


InpDExX 


WARWICK & YORK, INC., Publishers, Baltimore, Md. 
















|| 


ol] 


< 


|| 





I Wa 


AVIS 


.T 


. 
~ 


UTY 


THE JOURNAL OF 
EDUCATIONAL PSYCHOLOGY 


Volume XIV October, 1923 Number 7 














A STUDY OF THE DOWNEY TEST BY THE 
METHOD OF ESTIMATES 


NORMAN C. MEIER 
State University of Iowa 


The growing interest which has been evinced in the Downey Will- 
Profile! (Will-Temperament Test) since its publication in 1919 is per- 
haps attributable to the current interest in tests which may serve to 
supplement the measures of general intelligence, and to the fact that it 
is the most systematic attempt yet made in the appraisal of temper- 
amental qualities. This field of volitional and temperamental 
measurement presents complex and difficult problems, so that it seems 
somewhat unlikely that any test will ever sample the variables which 
constitute temperament, with a high degree of certainty. The study 
reported herein? is an attempt to ascertain the degree of reliability 
which this promising test possesses. 

A consideration of the previous studies of the test* gives a variety 
of conclusions concerning the value of the test, and reveals certain 
limitations and shortcomings in procedure, unforeseen at the time, 
from which this study has profited. Space limitations forbid a critical 
review, but the points of departure will be found in the context. 


EXPERIMENTAL PROCEDURE 


Two procedures may be followed to secure data upon the reliability 
of the test. The first consists of giving other objective tests of known 
validity that would provide measures of traits similar to or identical 


' Downey, June E.: ‘‘The Will-Profile, A Tentative Scale for the Measurement 
of the Volitional Pattern.’’ Psychology Bulletin No. 3, University of Wyoming. 
2 At the University of Chicago, under the direction of H. A. Carr and F. N. 
Freeman. 
*See Bibliography, under Bryant, Clark, Downey, and Ruch. 
385 








| 
iat 
at 
| 
| 

| 
4 
4 
/ 
bat 


386 The Journal of Educational Psychology 


with those in the Downey scale. The degree of agreement of the two 
sets of tests would be the basis for drawing the inference of validity. 
Certain difficulties, however, render this procedure impracticable: 
chiefly, that the time requirements for each person to undergo all the 
necessary tests would be prohibitive, and that a sufficient number of 
such tests are not yet available. 

The second procedure, followed in this study, consists in first 
administering the test to a comparatively large number of subjects, 
and then securing for comparison purposes three independent estimates 
from persons who know the subjects well. This method admits of the 
possibility of uncertain conclusion: that is, in the absence of clues to 
the contrary, it may be impossible to ascertain whether greater validity 
is possessed by the test-scores or the estimates. 

Because of its obvious significance for vocational and educational 
guidance and as its application would be peculiarly appropriate to 
adolescents, it was decided to use high school students for subjects. 
Through the cooperation of the Superintendent of the Laboratory 
Schools and the Principal, the test was given in individual form to 
106 students in the University High School. Of these, 64 took the test 
in group form later. This high school draws its attendance from the 
city of Chicago and environs; about 10 per cent come from University 
faculty homes, the remainder from diverse situations in life. Of the 
number tested, 52 were boys; 54 girls. The median age was 16, the 
distribution ranging from 13 to 20. The median mental age was (94 
cases) 17 years, 3 months, and the median IQ 112.0.1_ In scholastic 
position they ranged as follows: 


POOLED 1 
SPS Te na a ae - 27 25.5 per cent 
NE ek ae 5 aS el ae ous Slee 37 35 per cent 
Sophomores....................... 19 18 per cent 
PUNE. 5 I TE ae et 
SPER Tee 5 4.7 per cent 


The names of subjects were drawn, with few exceptions, at random 
from the school files by the office assistant. The subjects were in 
nearly all cases quite naive as regards this test—a condition which rules 
out the disturbing elements of anticipation, of the tendency to analyze 


'The Terman Group Intelligence Test was given by Mr. Wm. C. Reavis, 
Principal of the U. H.S. The IQ was obtained from the point score by tables. 





pury 
whil 


circl 
a mi 
the 

he v 
sub] 
kept 


wer 
kne\ 
ovel 
thir 
Sun 
olde 


cult 
exte 
Dov 
and 
just 
rais 
whi 
ical 
indi 
part 
tion 
thes 


sam 
obt: 
und 
mat 


Jour 





Eo al 





A Study of the Downey Test 387 


purposes as one proceeds, and of over-zealousness to do the right thing 
while missing the real import of given directions. 

Other conditions were maintained with the view toward having all 
circumstances favorable for best results. Distractions were reduced to 
aminimum. The administering of the individual form was done after 
the experimenter had given the test to 24 other subjects, so that 
he was neither unfamiliar with the test nor inexpert in giving it. The 
subject was kept in ignorance of the purpose of the test. Time was 
kept by a laboratory stop-watch, reading to one-fifth second. 

The names of persons who were to supply estimates of the traits 
were secured from the subject. These raters were (a) the teacher who 
knew the subject best; (b) a parent; and (c) some friend (not a teacher) 
over 18, who had known the subject for at least two years. In this 
third class were included older friends of the family, scout masters, 
Sunday-school teachers, relatives not in the same household, and 
older chums. 


METHOD OF SECURING RATINGS 


To find a method for securing reliable ratings proved to be a diffi- 
cult problem. A searching of the literature failed to reveal any 
extended descriptions of the traits, not excepting the writings of Dr. 
Downey herself. The theoretical discussion in the original bulletin’ 
and the journal article? lack a complete and definite statement of 
just what the traits mean. A further objection to their use may be 
raised in that they are couched in psychological terms, the meaning of 
which is not agreed upon even among persons in the same psycholog- 
ical laboratory. Discussions with persons not psychologically trained 
indicated a much greater obscurity in the meaning of these terms, 
particularly “‘motor impulsion,”’ ‘‘freedom from load,’ and “ voli- 
tional perseveration;”’ hence the obvious necessity of first putting 
these expressions into non-technical language. 

It is essential that Rater X have the same or approximately the 
same idea of each trait as Rater Y if comparable data are to be 
obtained; hence an attempt was made to formulate as complete and 
understandable a statement of the trait’s meaning as possible. The 
material consulted comprised the articles referred to, the revised 





'The Will-Profile, op. cit. 
* Downey, June E.: Some Volitional Patterns Revealed by the Will-Profile. 
Journal Experimental Psychology, Vol. I1I (1920), pp. 281-301. 





=< ~ 
-—e 


ene 
Ng Se ee eae 
—— - a _ 

ee oat 





a rer 


+ ¢ ) (ag ae 


~e 


eee 
La 
 apemle= 
—— ore 


Sie sebtathaioens <> Se 


(aa. ‘eer aide 
$e we 


oo 


= 
» 





ee ees 





388 The Journal of Educational Psychology 


Manual,! lecture notes,” and the treatment of some of the traits in texts.* 
When drawn up they were presented to competent critics for exam- 
ination and suggestions. When the form had undergone four 
revisions it was sent to Dr. Downey who after minor criticisms, 
approved it.® 

In constructing a satisfactory rating device to embody the three 
conditions laid down by Rugg® it was decided after some experimenta- 
tion to adopt a modification of the man-to-man comparison type of 
scale. Instead of requiring the rater to construct twelve master- 
scales (one for each trait), he was asked to recall all the friends and 
acquaintances he could who are fast (or slow)—if the matter involves 
speed of reaction—in this particular trait, and to place the person 
rated, appropriately, in comparison with them. The judgments were 
to be registered on a graphic rating scale, which would encourage 
greater freedom in use of the entire range rather than the middle 
points predominately, and which provides finer numerical units. 
The preliminary instructions were designed to free the rater from all 
presuppositions regarding the nature of the ratings desired; further- 


1 Manual of Directions, Downey Will-Temperament Test. Yonkers, N. Y., 
World Book Co., 1921. 

2 Course, by Professor Downey, University of Chicago, Summer, 1920. 

3 F.g.: Motor impulsion and inhibition, in James, Psychology, Chap. 21, N. Y.., 
Holt, 1890. Other traits are treated in Ribot, Psychology of the Emotions, Chap. 
7, N. Y., Scribners, 1898. 

4 The writer wishes to acknowledge his great indebtedness to Professors Carr and 
Freeman, whose searching criticisms provided stimulus for five revisions. Valuable 
suggestions were supplied by Professors Kingsbury and Robinson and by Mr. 
Kornhauser. 

5 A description of the form is impossible here because of space limitations but 
the following description of the first trait may be taken as a sample. 

1. Speed of Movement. 

Usual rate of movement (relative to his size and age). 

How does he move about naturally, normally, habitually? Consider the 
rate after he gets started, and when not impelled by any particular urge. 

Does he walk quickly—or slowly? How does he talk in usual, normal 
speech? Write? How fast, or how slow, are his movements in shuffling and 
dealing cards; in sealing and stamping an envelope? Not how fast he can 
walk, talk, write, seal an envelope, if under stress, but how fast or how slow does 
he when at his usual duties? 





slow fast 
average 


6Is the Rating of Human Character Practicable? Journal of Educational 
Psychology, Vol. XII (1921), pp. 425-38. 





m« 
ex! 


fre 
thi 
cle 


of 


poc 


Poc 


es! 


sey 


Tez 


es) 


Par 


Fri 
es 


Us 


for 





eS 
on 
re 
ge 
lle 
ts. 
all 


er- 


Y., 


ap. 
und 


ble 
Mr. 


but 


the 


mal 
and 
can 
joes 


mal 


A Study of the Downey Test 389 


more, to bait his interest, and to convey some idea of what would be 
expected in a normal distribution of degrees of traits. 
The forms were sent by mail with stamped envelope for return. 


THE DATA 


Of 106 forms sent out to each group, 95 per cent were returned 
from the teachers, 86 per cent from the parents, and 73 per cent of 
the friends. Of these, 64 per cent were common to raters in the three 
classes, which after discarding four because of incomplete scores, left 
64 or 60 per cent, perfect and complete. These were used for most 
of the computations which follow. 


I. Correlation of test scores with the estimates of the three sets of judges— 
pooled. Trait by trait. ' 












































& E = S 

> ls 7 8 3 5 

s |& Silos Og S S 3 

ig? ess 2sise Ble |eelaé 

S = =j;38! &| 83 &% >, = lee ss gee] g 

<2 a} ‘2| gs /Ssis = & |o'a| so = ¢ & 

Sel EE] F/ EE] S| $8) 88/25] 318s) 88138) & 

e"lenm| Blah) Sslecle lat] S1SF lotro < 

| 

+ .08) + .08 +.08| +.08 + .08) + .08 + .08) + .08 + .08) + .08 + .08) + .08 + .08 

Pooled | 








estimates... , an an! a om .14 


14] .24| .21| .07) .03| .1183 





II. Correlation of test scores with the estimates of the three sets of judges— 
separately. Trait by trait. 











+.08)} + .08 + 07| +.06 +.08| +.08| +.08) +.08| + .08|) +.08| + .08) + .08| +.08 
Teachers’ 
estimates... .24; .17| .36) .11) .02 .04; .03) .O1; .16 18) -12| .01; .0075 
+ .08; + .08; + .08| + .08) + .08) + .08| + .08/ + .08 + .08| + .08 + .08) + .08/}+.08 
Parents’ 
estimates... 11 .O1; .07| .16; .06;) .13) .05) .06) .09) .22| .05) .10) .0542 
| | 
+ .08 +.08| + .08| +.08 +.08| +.08| +.08| +.08) +.08)+.08 + .08) + .08 +.08 
Friends’ 
estimates. . . .13} .06) .30; .10) .20) .07; .02) .O1; .04) .03) .17| .09) .0067 
| 









































The Spearmen Rank-Difference Formula was used for these computations. 
Use was made of the Scott Company’s “Tables to Facilitate the Computation of 
Coefficients of Correlation by the Rank-Difference Method,’’ which follows the 
formula 

6=D 
+) Sato) 








. . 
erm. 


oe 


> aoe ¥ 
. aye? 
— Se 











390 The Journal of Educational Psychology 


III. Correlation of estimates with estimates of the several judges. Trait by 
trait. 



























































b 5 siais 
> |g 3 8 Si Sig 3 
3 iS 3 |e 3B a) 3 
& & f=" ~ as] — = “= —_ sc ee 3 
> = 2£| osl «4 &'asiase 
s § b= s E 8%, €Si ne r= = & a 2 © 
| 9° ~ at > to el 3 2-8) ‘s R - 3 oF i) 
Selysul Sizvysl SIBSSSsise & 8 EaAls 2 Fa 
sei28| 8/88) S188) e883] 3| S| seas § 
a-ie"| & la?) Siecle lle & alo~ir™ < 
| | 
+ .08| + .08] + .08| + .08] + .07| +.08) + .08| + .08| +.08/ + .08/ +.07| + .08| +.08 
Teachers 
parents..... .23| .07| .10| .21| .34| .03/—.01! .03| .24] .03} .33) .17| .1425 
eho ie + .08| +.08| + .08| + .08) + .08| + .08| +.08| +.08/ +.08| +.08/| +.08| +.08) +.08 
friends...... 
riends .30|} .06/—.19| .01| .37| .13] .04/—.12| .02] .07| .28|—.02| .o792 
+ .06 + .08| + .08| + .08) + .05| +.08] +.08| +.08| + .07| + .08/ + .08| +.08/ +.08 
Parents 
friends...... .56| .06} .18| .43 x .15| .00} .24| .33] .19) .30) .30! .2850 









































(Rank-Difference Method) 


IV. Correlation of the individual form with the group form. Trait by trait. 
| 


| 
+.05 06) +.08| +08 inl 5 tal eel s.tolaced a es +.08 
Individual | 


group....... | .57 on tie Ho eae Dae eine Hao . 55 oa .04; .20) .2230 











(Product-Moment Formula) 


V. Correlation of Downey test total—scores with point scores of Terman 
group intelligence test. 


p = +.21 (Rank-Difference Method) 


VI. Correlation of total-scores from individual forms with total scores from 
group form. 


p = +.60 (Rank-Difference Method) 


The measure of agreement between the pooled estimates and the 
test scores is, in the viewpoint of Rugg,! of most significance. These 
correlations appear to be consistently low or negligible. Correlation 
is present but low in certain traits, namely speed of movement, flexi- 
bility, speed of decision, motor inhibition, and interest in detail. 
In the other traits it is indifferent or negligible. 

From these gross results it may be concluded that disagreement 
exists, but nothing more. To convict the test without inquiring into 


the character of the witness would be a misuse of statistical method. 


1 “Ts the Rating of Human Character Practicable?” op. cit. 





— —— RT ee 
- ee reengne:: _ — es PETRI TT 








tr: 


th 


re 


is 
te 


£1 


—- san Bee ont bt oeelCUlCU CU le 








ym 


he 
se 
on 


il. 


it 
LO 


A Study of the Downey Test 391 


ANALYSIS OF THE DATA 


The explanation of the disagreement may lie in any one of three 
directions. There are these possibilities: 

(a) That the test is inadequate or defective as a measure of these 
traits; 

(b) That the estimates are unreliable; 

(c) That both test scores and estimates are expressions of some- 
thing, but of different things. 

Considering these in turn, it may be said that (a) is prima facie a 
reasonable and natural interpretation. Certain facts, however, cannot 
be passed by without comment. The first is that, if this hypothesis 
is the correct one, some evidence would be afforded by repeating the 
test with the same subjects. This was not done, but the giving of the 
group form two to four months later constitutes a condition approxi- 
mating repetition, since a number of the traits are tested in similar 
ways in both forms. High correlation appears in five traits (those 
objectively secured and scored) while in the other traits negligible 
or inverse correlations are found. This indicates some degree of 
confirmation in the instance of the five traits, but from another view- 
point it casts serious doubt on the worth of the group form. The 
second point is that the nature of the problem is such that usual limits 
for high and low correlation could not, in perfect fairness, be applied. 
The error-possibilities are exceptionally manifest. To enumerate a 
few there are first the errors of interpretation: varying interpretations 
of the printed descriptions of the traits; errors due to a common bias 
on the part of all observers; those due to observers taking different 
points of view, and random errors, including guesses. The errors of 
recording also operate to limit the correlation possibilities; these arise 
because of individual differences in the process of registering the 
judgment, once made. Since the rater is using for his yardstick a 
group of acquaintances, his judgment will be the same as that of 
another judge’s only in the instance that the two have identical groups 
of friends, which rarely if ever is the case. The circle of one judge’s 
friends may embrace fewer than one hundred persons; an other’s may 
include ten times that number. Persons differ, furthermore, in the 
accuracy of their observations and in the keenness of their estimations 
of ability. This condition offers no excuse for lack of consistency in 
the test, but rather suggests the inherent difficulty of this type of 
research—whetker such estimations of human qualities are practicable 
at all. 


- 


e 


a] 
Y a 
3 J 


— 
- 
: 





Puiate I, 


The Journal of Educational Psychology 











392 








ee ee Sina OO a A OE pis 2 Hoey 
. 4 ‘ Pees 





_ a tn tt ncaa 


” . . Pome et i ’ 
a 4 é : * on 4 Pe i, 
: ; ee i. sis Se eet tr, we a — i, 2 he mo ee ao |, 





393 


A Study of the Downey Test 














Puate II. 








eee nate geet - 


ee 


+ & ay 
——— 


ae 


a Snclon. 
AX 
é = # 
Pea ee a a 





394 The Journal of Educational Psychology 


(b) That the Estimates Are Unreliable-—With the expectation that 
the plotting of distribution curves would throw some light upon the 
disparity between scores and estimates, curves were made for each 
trait, for each of the three series of estimates. (See Plates I and II.) 
Normally the test scores would be expected to follow a rectangular 
distribution since that distribution was adopted by Dr. Downey in 
constructing the norms. This condition was found to obtain in 
traits 1, 8, 9, 11 and roughly in 2.. Trait 6 was concentrated about 
the score 4; 9 about 5; 10 and 11 about 8 and 9 respectively. 

An examination of the distribution of the estimates reveals other 
significant features. Of the 36 curves representing distributions of 
estimates, only 12 show any resemblance to a normal probability 
curve: 1-t, l-p, 3-f, 5-p, 5-f, 6-p, 7-t, 8-t, 9-t, 10-t, 10-f, 11-t, and 
12-f. Twenty-one of the total exhibit an undue and disproportionate 
concentration about the score of 5. Seventeen show a similar concen- 
tration about the scores of 9 and 10. Four instances occur of a 
concentration about the scores of 7 and 8, or the three-quarters point. 

The “‘bunching”’ of scores about the half-way point in so many 
instances (and at the end and three-quarters position in other in- 
stances) strongly indicates that in more than half the cases the raters 
were unable to come to any certain conclusion and hence had recourse to 
the conservative ‘‘average”’ rating; and in the cases of higher ratings, 
were likewise unable to judge accurately and rated high because the 
bias worked toward giving a favorable return. 

With such observations in mind it becomes extremely probable 
that the unreliability of the ratings is very considerable, and that 
therefore the fact of low correlation may be traced to this as one of the 
sources. Just what this unreliability has to do with making the 
correlations as low as they are it is impossible to determine. The low 
correlations among raters adds weight to the supposition of unreli- 
ability ; furthermore, the one instance of high correlation (parents with 
friends) is explainable by the act that in numbers of cases the returns 
are practically identical, indicating that the judgments were made in 
conference (this is known to be the case in several instances). 

(c) That Both Test-scores and Estimates are Expressions of Something 
but the Two Values Are Different Things.—It is conceivable that the 
test for speed of decision does not actually measure speed but rather 
something allied yet different, as possibly care in decision, or honesty 
in decision; while the rater was judging with quite a different criterion, 
still he may have emphasized ease in coming to decision or some- 














sT—lU al ba, —_ = “vw =F 


\e — SS —— a 


— = 


= \w \y cr \ 


_—rlClCU ACO hr 


— “oe SS | 


' ~ 


A Study of the Downey Test 395 


thing else. In any such instance no considerable correlation could 
be expected. In this possibility there is no practicable way of checking 
back to investigate the actual condition. 


CONCLUSION 


In summarizing these results it may be said that: 

1. Estimates of traits judged from fairly explicit descriptions of 
them show indifferent or negligible correlation with scores earned in the 
Downey Test, individual form. 

2. This absence of high positive correlation is not to be taken as 
absolute evidence against the validity of the test, for the reason that 
there are indications that the estimates were in many cases unreliable; 
but rather as grounds for questioning the value of the test for certain 
of the traits. 

3. From the point of view of practical utility, these results may be 
taken to indicate that, in its present state of development, the test is 
considerably imperfect because the traits it purports to measure are 
not such that people can readily understand and identify. 


BIBLIOGRAPHY 


1. Bryant, E. K.: The Will-Profile of Delinquent Boys. Journal of Delin- 
quency, Vol. VI, 1921, pp. 294-309. 

2. Clark, Willis W.: Supervised Conduct-Responses of Delinquent Boys. 
Journal of Delinquency, Vol. VI, 1921, pp. 287-301. 

3. Downey, June E.: Graphology and the Psychology of Handwriting. Educa- 
tional Psychology Monogmpe 34. Baltimore, Warwick and York, 1919. 

4. : The Adolescent Will-Profile. Journal Educational Psychology, 
Vol. XI, 1920, PP. 157-64. 

5. : Some Volitional Patterns Revealed by the Will-Profile. 
Journal Experimental Psychology, Vol. III, 1920, pp. 281-301. 

6. : Ratings for Intelligence and for Will-Temperament. School 
and Society, Vol. Ix, 1920. 

7. Ruch, G. M.: A Preliminary Study of Correlations between Estimates of 
Volitional Traits and Results from the Downey Will-Profile. Journal Applied 
Psychology, Vol. V, 1921, pp. 159-63. 

8. Rugg, H. O.: Is the Rating of Human Character Practicable? Journal 
Educational Psychology, Vol. XII, 1921, pp. 425-501; Vol. XIII, 1922, pp. 81-93. 











Dy —s 


ee cae’ 
ak Se 


‘ 


wv. i ~~ 
peo te a 








odntting Steen anna 


Pr PPS oo . > a 
Le tg ene + RT er tne er et ee 


SESS 
7 a. t 


i tet * PSH 
eae oe meee mtn at ngs an 








THE JOINT YIELD FROM TEAMS OF TESTS 
CLARK L. HULL 


University of Wisconsin, Madison, Wisconsin 


In the early days of mental testing, even where several tests were 
combined for a single determination, little attention was paid to the 
correlations among the tests themselves, although in a few cases there 
seems actually to have been a conscious attempt to have the tests 
correlate as highly among themselves as possible. At the present 
time, however, it is generally recognized by psychologists that for the 
tests of a team or battery to yield the maximum prediction, there must 
be not only as high a correlation as possible between each test and the 
criterion, but at the same time as low a correlation as possible among 
the individual tests. 

This fundamental contrast accordingly divides the correlation 
coefficients involved in a team of tests, into two distinct groups. The 
division in question is shown systematically in Table I. Here 
appear the correlation coefficients necessary for a regression equation 
involving five tests. In the subscripts, J signifies the criterion, the 
Arabic numerals designating the various tests. The coefficients 
appearing in the first column should accordingly be as high as possible 
while those in the remaining columns should be as low as possible. If, 
for purposes of analysis, we assume all the coefficients in the 
first column as equal, we may represent them without distinction by7’; 
and if for similar reasons we assume all of the remaining coefficients as 
equal, we may represent them uniformly by r”’. 





TABLE | 
To be as high as possible . 
) , To be as low as possible (r’’) 
Ti 
Ti2 Ti2 
T13 T13 T23 
T14 T14 T24 T34 
Tis Tis T25 Tss rT 45 


| 





Now if the special 7 values assumed above be substituted 
appropriately in the formulas for partial correlation and the results 
simplified in each case, the following expressions are obtained: 

| 396 


tk 





Joint Yield from Teams of Tests 397 














ri —p'r” 
2.3.2 V1 6 r'2a/] — 7/2 
rv! 
12.3 = 74 pl 
r = af ral — r*) ills 
11.23 (1 4 rt — 2r’?)(1 + 2r’’) 
Pd 


r12.34 = 1+ 2r” 





r’2(] 7” y!'2) 


ie ashy \ (139 — 3r’? — Br’? + %7’2)(1 + 3r”’) 


These values being substituted appropriately in the general formula for 
multiple correlation, 


Ri(12....n) = WL — (1 — rn) — rina)... (1 — rgnaae. <n -1)) 
we obtain by successive application, the following: 


2 
For two tests, R =r’, [2 

3 
For three tests, BR =r’. — 


4 
SL) Te "ae wee 
For four tests, R =r mi C3 











The succeeding members of the series may now be written by analogy 
thus: 


’ 5 
For five tests, R =r ya 
6 


For six tests, R=, Fog 


Or, more generally: 




















au sail ” 
For n tests, R= aE TP aL) (1) 
From (1) is obtained directly: 
R1 — r’’) 
ry r/2 — rR? (2) 
»_ p fltr(a—} 
r Ry > (3) 
» _  — BR 





JT = R{n — 1) (4) 








~~ 


398 The Journal of Educational Psychology 


With the aid of equations (1) to (4), we may now proceed to the 
quantitative analysis of the relation of the four basic factors involved 
in predictions from teams of tests. The first matter to be considered is 
the amount contributed to the yield (2) of a team of tests, by the addi- 
tion of each constituent test. To be specific; if we assume that r’ = 
40 and r” = .20, how much does the addition of a second test to the 
first of a team, increase the yield over that given by the first alone? 
How much does a third test add to the yield of the first two? How 
much does the fifth or the seventh test add? How much will 50 or 100 
tests yield over five or six? By substituting successively in equation 
(1) we obtain the values in Table II, which gives the yield of varying 
numbers of tests of the potency assumed above. 

Perhaps the most striking fact revealed by Table II is the marked 
tendency to diminishing returns met with as successive tests are added. 
This tendency is shown even more clearly by Table III. 


Tas_eE II 
Showing the yield of varying numbers of tests where r' = .40 andr” = .20. 

2 tests, R = .516 50 tests, R = .8596, 

3 tests, R = .584 | 51 tests, R = .8616 

4 tests, R = .632 100 tests, R = .8768 

5 tests, R = .667 101 tests, R = .8772 

6 tests, R = .693 1000 tests, R = .8926 
TaB.eE III 


Showing the amount added to the yield (R) by each succeeding test of the series 
considered in Table II. 


No. 1, .40 No. 5, .035 
No. 2, .116 No. 6, .026 
No. 3, .068 No. 51, .002 
No. 4, .048 No. 101, .0004 


A somewhat more comprehensive view of the variability in yield of 
teams of tests as dependent upon the values of n and r” is shown in 
Fig. 1. The yield is shown in multiples of r’ in order to make them 
comparable and as general as possible. Here it is seen that at the 
extreme where 7” = 1.00, there can be no increase in R from the 
addition of new tests. As 7r’’ grows less the possibility of increasing R 
by the addition of tests becomes increasingly greater. In all cases, 
however, except where r’’ = —.10, the curves show a tendency to 
diminishing returns with the increase in the number of tests. 














4r’ 





- - é , 7 : ' ' s - z ~ 
: one ~ * - _ <-- -—— — : “ am TT —_— ee — - —— 
— en — _— = ~- ~Swye fat. ee 4°" = Ss 5 = : - —e = “= - = - aa — + ~~ = . 
; at Ta Soe aes © Said = : : ; : Selita went ey: STD Stl a eae SE TF RE Cais 0S Tg FM oot ¥ 
. “3 ; “. on oe "sree 
~ ~ 2 = Pio on - re — » e 


- 
4 a oe - 





“M1894 OY} JO 83804 
[eNplArpul oy} Zuouls souspucdeput jo seoiZep BurAiva Bulavy &}604 JO surve} wio1y pjoré oY —"T ‘O17 


8789], JO OQUINN eyy 
eT 2 wa ue ee. eh Ee oh O.hUel 


399 


i] 
- 


‘4 ‘ a 4 
4 




















122 


Een 
A er = pA 
© » 148 


ai JO SUIIOT, UT polyA 








Joint Yield from Teams of Tests 


ov 





ore 


1 


























Tad 














rr S 


oo wai se = 6h petty ta ee ti od ee el lO 


‘g7 JO} PUSS OY} : 1 10J SpULIS SOAINO OY} UO puNo}s 
Sroquinu jo syed Yous jo 4siy OY], ‘sonVa ,1 PUB J SNOLIVA YIIA 89804 JO 8UIBO} JO plelA oY T—"z ‘oI 
$389, JO JaquINN oy, , 


Pw 
= 
o 

i oo 
wa 


a) A | tm oO 6 8 «4 
Ww 


arr 
































Ss” | "0s" 


ze, 


\ 





=> 

S 

S 

= 

S - 

ay At 

8 

S$ ot 
Ss Z nae 
S we 5 
s Q 
cS a / - 
> / oe all s 
: —— — 
S ‘7 pac 
S 

Ss b 

© 4" 
os 























400 
\ 
































pean wT ne 





> 
aap maaingn ate ES _ 
ae wee ~ “a sow + - - 
. I eT RE ee ante . — -—-~ st ee ~ ee So a « sleet < : 
- = 
— — ser tate ee en a - 
PO Toe 5 ~ .  eenenentd es aa a ~~ 


Pe A fine » 
. . Pe a 
: Poe ed ics ee ; ree 





oe ge Si 
z 


—~ 


The first of each pair of numbers 


found on the curves stands for r’: the second for r”. 


Fia. 2.—The yield of teams of tests with various r’ and r” values. 


Joint Yield from Teams of Tests 401 


The relatively simple situation revealed by Fig. 1 undergoes inter- 
esting complications when 7’ is also varied and the yield is shown in 
terms of R. <A few typical curves of this second kind are shown in 
Fig. 2. The two pairs of curves originating respectively at .30 and 
40 may be compared. In the upper curve of each pair 7’. = .00; in 
the lower, r’’ = .20. The frequent crossing of the various curves 
shows that there are many cases of approximately equivalent yields 
from different r’ and r”’ combinations if n be chosen suitably. For 
example, there is no choice between (r’ = .40, r’’= .00) and (r’ = .50, 
r’’ = .65) where two tests are used. With a smaller number of tests 
the combination (r’ = .50, r’’ = .65) gives the greater yield but with 
more tests the combination (r’ = .40, r’ = .00) is decidedly better. 
With five tests the combination (r’ = .30, r’’ = .00) yields the same as 
(r’ = 40, r” = .20), both being better than (r’ = .50, r” = .65). 
With more tests, (r’ = .30, r’’ = .00) is decidedly better than either. 
The remarkable upward sweep of the curve (r’ = .05, r’” = —.10) is 
also to be noted. Speaking roughly, high values of both r’ and r” are 
superior to low values of both where only a small number of tests are 
used, but the latter tend to be superior where more tests are available. 

Now the 7’ values of single aptitude tests for genuine vocations do 
not ordinarily rise much above .40 and they are usually lower. Ifthe 
tests are chosen wisely, r’’ may be kept down around .20 or so. After 
considering the above examples of the diminishing returns in the yield 
of teams of tests with the increase in the number of tests employed, 
particularly as shown in Table III, we find ourselves questioning 
whether the increase in yield from the addition of a sixth or a seventh 
test of the grade just mentioned, will be worth the extra time and labor 
necessary for the giving and scoring of it. It is quite evident that on 
the principle of diminishing returns a point is certain to be reached 
sooner or later where the addition of a new test will not pay. It is 
clearly a matter of some importance to the future of vocational psychol- 
ogy, whether this point is reached before or after a really satisfactory 
R has been obtained. It is evident that the point at which this critical 
change takes place, is dependent upon a number of factors. As a 
necessary preliminary in the analysis, it is desirable to determine the 
number of tests of the grade assumed, which will be necessary to 
produce an RF of satisfactory proportions. Assuming R = .75 as 
about the lower limit for purposes of satisfactory vocational guidance, 
say, the number of tests required may readily be found by substituting 
appropriately in equation (2): 


bern er ei — = = 


| 
. 
| 
| 


Cb Greets rts e 


Ea ERS 
~ ae 








« ca RR Rr NN Ee ma ge my 








‘| 
4 
af ‘ 
{ | ; 
fi 
: 
i} 
i 
: 
- 
i 


402 The Journal of Educational Psychology 


__.75? (1 — .20) 
~ (40? — .20 X .75? 
= 9.47 





It must be confessed that the prospect of having to use nine or ten 
tests is not particularly reassuring. It is true, of course, that certain 
batteries of tests now in use include a number of test units approaching 
this, notably Army Alpha which has eight. Such a large number is 
only possible where the methods of administering and scoring the 
tests are very economical. Under extremely favorable circumstances 
in this respect it is likely that, if available, a number of tests distinctly 
greater than ten might be profitable. But the availability of such a 
large number of relatively high grade tests which will at the same time 
be economical in administration, is a much more serious question. 
As a general thing, the tests which have been economical in administra- 
tion so far, have been group tests. But as now conducted, such 
tests seem fated to have an extremely high correlation among them- 
selves owing to the large element of pencil-and-paper behavior which 
they have in common. Thus the r’’s of Army Alpha range around 
.60: and those of the National intelligence tests run even higher. 
Fortunately the r’’s of these tests also run much higher than the .40 
assumed above. But even so, the proportions of curve (r’ = .50, 
r’ = .65) in Fig. 2, show that the hope of increasing the size of R 
very much by the multiplication of such tests is not well founded when 
the number exceeds three or four. Indeed, it is probable that one or 
more of the less efficient units of each of the well known batteries of 
tests just considered, might be discarded without appreciably diminish- 
ing the predictive yield of the respective teams as a whole. 

An interesting question suggested by Table I, is whether the com- 
bination (r’ = .40, r” = .20) would ever produce a perfect R (1.00) 
even with an unlimited number of such tests available. To answer 
this question it will be necessary to carry our mathematical analysis a 
little further. An inspection of the right-hand member of equation 
(2) above, shows that r’’R. may never become larger than r’s, else n 
would become negative, which is impossible. That is to say: 


r’R?> rs (5) 
If we further assume F to be at its maximum (1.00) then (5) becomes: 
r’>r's (6) 


That is, if R is to reach perfection, then r” can never exceed r’>. We 
are accordingly able to state definitely in answer to the above question, 








ee ry © 


a) 





Joint Yield from Teams of Tests 403 


that no matter how many tests of the strength (r’ = .40,r’’ = .20) may 
be combined, a perfect prediction can never be obtained, since .20 is 
greater than .40? which is only .16. 

In cases such as that just considered, where a perfect prediction 
can not be obtained, it becomes a matter of some interest to know what 
the maximum possible yield may be. By referring to equation (2) 
once more, it is apparent upon inspection that n will become infinite if 


r’'R? = ry’? (7) 
Accordingly, ai 
= Se 

Rena 5 (8) 


But since by inspection, F# is at its maximum when n = o, equation 
(8) will give the limiting yield of any combination of r’ and r” values 
in which r’’<r’?. Thus in the case of the combination (r’ = .40, 
r’’ = .20) already considered, 


402 
Rnaw = "99 


= .8944 


As would be expected, this value differs only very slightly from that 
obtained when n = 1,000 as shown in Table II. Asa second example 
of the application of (6) and (8) we may consider Army Alpha where 
roughly r’ = .58 and r”’ = .60. 


53 
Raw = "69 


= .749 


It is interesting to note that this figure is not much above what was 
obtained by the army psychologists under favorable conditions. Of 
perhaps more significance is the indication that with an indefinitely 
larger number of such tests available, they could not have materially 
bettered the yield actually obtained. 

A special case of some interest is that in which the tests are all 
assumed to be strictly independent of each other, 7.e., where r”’ = 
.00. In this event equations (1), (2) and (3) reduce respectively to: 


R=r'V/n (9) 
n oi (10) 
r! = ss (11) 


a 


2 
7 
t 


Ge TA. a 


~hee Meyer ts 
en 


— —— ——— ee Ly te ee a a a I i —_— 
— - ovettllinatinsetindy seme as 


SD BE SO EERO RE 
. ‘ PO ay 
3 oS ag tere meen 








es — ~ 





st 
} 

s| 
hi 
i 





404 The Journal of Educational Psychology 


And if in addition R be assumed as perfect (R = 1.00) then (10) and 
(11) become respectively, 





1 

% = r'2 (12) 
l 

pa (13) 


By substituting appropriately in the above equations, we may obtain 
the r’ values necessary to produce the various values of Rk from varying 


numbers of tests. Certain of these are shown systematically in 
Table IV. 








TaBLe IV 
R = 1.00 R= .75 R = .50 
| j 
r’ | r’ r’ 
n= 2 | 709 532 354 
n= 8 | .578 433 289 
n= 4 50 875 25 
n= 5 446 335 223 
n= 10 316 .237 158 
n = 100 10 075 .05 














The entries in the column “R = 1.00” are of some theoretical interest. 
These figures show that from the synthetic point of view, the half of a 
perfect correlation is not .50 but .709; the third of a perfect correlation 
is not .333 but .578; the fourth of a perfect correlation is not .25 but 
.50, and so on. In this connection, formula (13) may be compared 
with Kelley’s “coefficient of alienation” or lack of correlation (k). 
This value is given by the equation 
k= V1 —?? 

Passing now from the mathematical to the speculative, let us 
consider the notion of a universal team of tests. Ideally this would be 
a battery of tests which would cover the entire range of the determiners 
of human behavior, yet with no overlapping of one test by another. 
How many really independent tests would it take thus to compass the 
range of human aptitudes? Judging by the great amount of overlap- 
ping of most mental alertness tests, it would not take very many really 
independent tests, if such could be found, to span this particular zone 
of human behavior. We know less about motor—and particularly 


a ee 


a _— ———_ a, —" ha - wi —" _ 


us 
be 


rT. 
he 
p- 


ne 
ly 


Joint Yield from Teams of Tests 405 


character—traits, but presumably it will require a larger number of 
tests to cover these respective phases of human potentiality, to men- 
tion no others.. And while it would be obviously unsafe to hazard a 
guess as to how many tests would be required as a minimum for such a 
team to approach universality sufficiently close to be useful, it is 
possible that the number may be smaller than might at first be 
supposed. 

The advantage of such a battery of tests for purposes of vocational 
vocational guidance, for example, would be enormous. Perhaps this 
may best be shown by pointing out the hopelessness of the system of 
vocational guidance as it is now developing. To be able to predict 
the probable vocational aptitude of a youth in one or two vocations, 
may serve very well the purposes of a prospective employer but by no 
means that of the youth seeking a life vocation. What the youth 
desires to know is what vocation of all the vocations of the world he is 
best endowed to pursue. This can be told only after knowing, in some 
sense, his aptitude in each. Now it may be that a really adequate 
vocational guidance is an impossibility, that psychologists will not be 
able to secure sufficient tests of the necessary diagnostic potency. 
But even assuming an unlimited supply of separate teams of tests of 
the type now generally aimed at—the tests of each properly weighted 
by means of a regression equation for its particular vocation—the 
problem of vocational guidance would still be far from solved because 
of the expense involved in such a system. If each battery should 
require one or two hours for giving, scoring, etc., and the multiplicity 
of vocations should be reduced to 40 or 50 type occupations, it would 
require from 50 to 100 hours of labor to discover the two or three most 
promising occupations for each individual! But if, instead of a multi- 
plicity of teams of tests, a single approximately universal battery of 
tests were available which, through the perfection of group testing by 
means of self-recording duplicatable apparatus and other devices, 
groups of 25 or 50 subjects could be tested at once, vocational guidance 
might easily be economical enough to become universal. By this 
latter method there would be merely a multiplication of regression 
equations, each equation weighting the same team of tests in a differ- 
ent way according to the peculiar mental requirements for success 
in each particular vocation. Thus but one set of tests would need to 
be administered and scored. And lastly there might be an automatic 
computing apparatus built on an extension of the principle of 
the modern statistical machines, which would solve the regression 


— * 


~ 


> sx 


~ eae F = ie on ee ee 


he 


ee el 
- : 


ES ee Ug ee RT Ree fe cree 
ss -s + - 
a ee a ed ~< > 


i 
| 
£ 
} 
; 
ee 
i 
hh 
im a! 
a 7 
e y 
— 2 
a 
: sS 
- 
mm 
i 
“ 
cea 
ee a 
fi 
, 


os 
= 


= 

















ee ainiies 


ee 
.-s yo 
ee 














406 The Journal of Educational Psychology 


equations. With the test data from a given subject punched down on 
an appropriate form, this might be fed into the machine and presently 
emerge with the whole series of predictions recorded in the units of 
some convenient, uniform scale that would permit of instant compari- 
son. This, of course, is all highly Utopian. But without some such 
revolution in method as suggested above, it is difficult to see how scien- 
tific vocational guidance can ever become a practical reality for the 
masses of the people. 

Passing to the immediately practical, there seems little doubt that 
the near future will see a tendency towards semi-universal teams of 
tests—single batteries of tests which will apply to a limited variety 
of occupations of a given type, there being separate regression 
equations based on the same tests, for the various occupations of the 
partiular group. And along with this ought also to come a specific 
concentration upon the psychology of the independencies among tests, 
with consequent changes in the point of view in devising new tests and 
in the technique of choosing the individual tests to make up teams or 
batteries for purposes of vocational prognosis. 








at — ell hielo — 


> Ss 


. ee 


— ErelUTTTClC(<“‘(‘ TCT OVO \w 


\e 


THE ADVANTAGES OF THE PROBABLE ERROR OF 
MEASUREMENT AS A CRITERION OF THE 
RELIABILITY OF A TEST OR SCALE 


P. H. NYGAARD 
University of Minnesota 


To measure the reliability of an educational test several methods 
have been employed. The first was to use the coefficient of correla- 
tion obtained by giving two forms of the test to the same group. A 
later improvement was to use the probable error of estimate of score 
between the two forms; that is, .6745 ov/1—r?. Recently it has 
been proposed, especially by Monroe, to employ the probable error of 
measurement, .6745 o+/1—r. This article will show, first, that the 
probable error of measurement possesses the advantage of stability 
over the two other criteria, and, secondly, that its value can be com- 
puted by a comparatively easy method. 


I. STaBILITY OF THE PE or MEASUREMENT 


It is a well known fact that the coefficient of reliability can be 
increased by increasing the range. If only one school grade is used, 
the coefficient obtained will be rather low, but all that is necessary to 
get a higher one is to give the two forms to a composite group con- 
sisting of several grades. Neither, as will be seen, is the probable 
error of estimate free from variation. However, the probable error of 
measurement remains the same for a composite group as for a single 
group. 

To show that such is the case the following assumptions will be 
made: 

1. That two tests, or forms of the same test, A and B, are given to 
N groups of n cases each. 

2. That the standard deviation of both A and B in each group is the 
same, namely oi. 

3. That the difference between the means from one group to the 
next for both A and B is constant and the same. Let c equal the ratio 
of this difference to o;; that is, c is the difference divided by a;. 

4. That the correlation between A and B is the same for each 
separate group; namely 7;. Also whenever r is used, the absolute 
value of r will be understood. 

407 











a 


ee ee ee 








< 4 
‘ote ae 
SS oe 





- Let this be represented by ry. Then ry = 1 — 


408 The Journal of Educational Psychology 


These assumptions are not far-fetched, for they represent what 
practically always happens, or should happen, when forms A and B 
of the same test are given to the school grades for which the test is 
intended. With these assumptions from which to start, the aim will 
be to investigate what occurs when A and B are given to a composite 
group consisting of N of the smaller groups. Here, for the most part, 
only results will be given, as the proofs involve quite lengthy algebraic 
manipulations. 

First, as to the composite standard deviation. Let this be repre- 
sented by ov. Then 


on = ai) -« be wa} (1) 


Secondly, in regard to the composite coefficient of correlation. 

12 (1 -~ T1) 
12 + c?(N?—1) (2) 
If r, = 1, thenry =~ 1—0=1. If 7; = 0, then there should be no 
12 A 
leg 
glance at the formula shows that ry increases with an increase in c 
and also with an increase in N. By a slight algebraic change the 
formula may be written 


2(N? — 1) (1 — : 
rn=nt* f Te ae = re » which shows that ry exceeds 1, 











difference in the means, so c = 0, which makes ry = 1 — 





except whenc = 0, N = 1,orr; = 1. 

Thirdly, the effect on the composite probable error of estimate. 
Let this be represented by ey, and the probable error of estimate of a 
singlegroupbye;. Then ey = ef + i are 3, Sine” ill (3) 

; ) [12 + c? (N? — 1)] (1+7)) 
The fraction under the radical sign is always positive, and hence e, 
exceeds €;, except when c = 0, N = 1, orm, = 1. 

Formulas (1), (2), and (3) may in themselves be of considerable 
service. They may be used to transform composite group results 
into equivalents for a single group, or vice versa. 

Fourthly, as to the probable error of measurement. Let this be 
represented by ey, and the probable error of measurement of a single 
group by e;. As the algebraic work in this case is brief, it will be 
given in full. Using the results of formulas (1) and (2): 


ey = 8745 14/1 + oO 1) ft - 14 — 11) 




















12 + c?(N?—1) 


ant 0ClCireeet Cl ellietti‘éie 





Probable Error of Measurement 409 


os a sl 1) va =r) 




















+c? (N? — 1)’ 
= 12+ sige 12 (1—1r,) 
€y = .6745 o1 12 " 724+ c(N? i 1)’ 


ey = .67450:V 1 — 7}. 
But €; is also equal to .67450;+7/1—7r;. 


Therefore, ev = €:. 


Hence the same value will be obtained for the probable error of 
measurement whether a single group or a composite group be used, 
which was found to be true of neither the coefficient of correlation as a 
measure of reliability nor the probable error of estimate. This is the 
basis for the claim made at the beginning, that the probable error of 
measurement is a stable quantity, whereas the other criteria are not. 


II. AN Easy MeEtTHop or CALCULATING THE PROBABLE ERROR OF 
MEASUREMENT 


The usual formula for the probable error of measurement, which 
will be called ¢, is: « = .67450+/1—r. Two forms of the same test 
have practically the same standard deviation. This will be assumed 
to be true. Arbitrary zero points will be used. The deviations from 
these arbitrary zero points will be represented by X and Y. By the 
ordinary formula, « = .67450+/1—r. j 



































zxy—{2E2") 
But r = =e 
zxy 22") 
Thereforee = .6745¢,/ 1— ; 
no 
Ore = 6745, ]o*— = ere SRE LE, 
2 2 
Substitute for o? its value, = A$ . 
n n 
Then e = .6745 = oer _ 2X¥ 4+ (2X) ee ; 
n n n n 











Sie 


x - *~ - 
o ie 
< 
F " = x epee ak meni 


pia aes ts *- 
ee - 





PRS ESI 
- 
Se ee 


TERMS Git p ies re Say 








410 The Journal of Educational Psychology 


Multiplying numerator and denominator of each fraction by 2, 
4 ZX? (2X)?7_ 2ZXY , 2 zX) ZY) 
nee 67454)2/ Qn 2n? | an +” Ont 
ZX? (2X)? _ oo _ SY? (ZY)? 
Pa et 3" he oe 


Substitute the latter for one of the former getting 















































is, 2X? (2X)? | SY? (ZY)?  2EXY | 2(2X) (ZY) 
i | i ea ee SS * 
vat — —22XY+2Y2 (2X)? —2(2X)(SY)+ (SY)? 
€ = .6745 ae ‘ 
2n 2n? 
Ore a FN ae ee _ (2X— ZY)? 
2n 2n? 


Let X — Y = V. 


Then « = 67454) | sy? — 2X%- a=) 
2n n 








By letting the arbitrary zero points be the zero of each scale, the 
X’s and Y’s are simply the original scores, and no deviations need to be 
computed. Nevertheless, the numbers involved will not be large, for 


only differences are squared. 


To illustrate the simplicity of the method, a short problem will be 
worked. Suppose the numbers in the columns below, headed X and 
Y, represent scores in forms A and B, respectively, of a test, whose 


variability is the same. 











x Yy vy 
75 71 16 
77 74 9 
82 82 0 
73 64 81 
81 81 0 
79 86 49 
63 76 169 
te Thea (765-775)? 
70 78 “ ee 67454| | 392 - | 
85 83 4 = 6745 3p (392 — 10) 
80 80 0 = 67454 /19.1- 


765 775 392 = 2.9+ 


~ Tr ~~ ~—— et 





Probable Error of Measurement 411 


To conclude, it may be stated that the probable error of 
measurement is by far the easiest to calculate of the criteria of relia- 
bility mentioned at the outset, and is the only one possessing the 
requisite stability. Therefore, the writer is of the opinion that it 
should meet with universal acceptance among the workers in 
educational statistics. 


SUPPLEMENTARY PROOFS 


The assumptions made here are the same as those previously made 
in the article on page 407. 

1. Composite Standard Deviation.—Before generalizing, let us take 
two particular cases. Suppose N = 3, andd = ca, that isd is the dif- 
ference between the means. Then, as may be easily seen by making a 
drawing of the three distributions, 


i a a + 2(r—d)*+ Zr? _ 
— On 
Se 1s —2dr+d?)+ dx? ~ p=; Dx? 2nd? 
3n 
Or a; =) +2/30° But = =¢;"- 
n n 


Therefore, o3 = ~/o;? + 2/3d? 
Suppose N = 4. Then we shall get o, = 


[Hered + G43) 26-2) 
4n 


2 
pas = + 2n— wn 






































Si gee 4 wm ere: 











Or o4 = 


No matter what number is used for N, the quantity under 
the radical sign will be o,? plus some number times d’. If N is odd, 
that number is evidently as follows: 


Al+4+9+ 2. . to We 
N 
— 2 
pat (1 +4+9+... to ~ > terms) = ie bios 1) 


QN(N?—1) _ N21, 
24N 7 re 


: terms) 











Therefore, when N is odd, the coefficient of d? is 











. orale S ee ae — 


3 
; 
| 


a he " “4 
3 Ee tea 
teas Whe te 





a ~ 
Pa * Te yee 
a eee eT Nr I SIFT = I ema 


" yo Pde toe S. 
GE I TOOLS AOL ALO, 8 NB AO a 


412 The Journal of Educational Psychology 


If N is even, the coefficient of d? is as follows: 











2(1/4+9/4+25/4+ . . . to N/2terms) _ 
V = 
(1+9+25+°. . . (to N/2 terms) : 
2N 
iaek 
But (1+9+25+ .. . to N/2 terms) = wed Ea 


Therefore, when N is even, the number by which d? is multiplied is 
N(N?2 - 1) em hs 























12N wee 
DS ms 
In either case, therefore, the coefficient of d? is 12 . Hence 
2 7 cm 
for any value of N, oy = ai" +" Sie a. Replace d by its 
24.2 -_ 2 ie 
value co,. Then oy = 4{o:? = ish Di aif + a 12 1). 


This is formula (1) of the article. 
2. Composite Correlation.—Again, to make the reasoning clearer, 
N will be given definite values. Suppose N = 3. Then the ordinary 
formula for the coefficient of correlation gives 
_ Fa+d)y+d) + 2ae—d)(y — d) + Say. 
3no 3" 3 








eae 
— by k. Then by 


c?(N. 
Let us for the time being represent 4/1 + or 


formula (1), ov = ko. 
Therefore, 03? = k?o,?. 
32ry+ 2nd? ry , o, @. 


Hence, —  8nk’e,2 aid ‘a 3 3h2q 52 
p> rs 
Bu Pas —- =r, Therefore, ry = = 4. 28 ko, 


acces N =4. Then ry = 


>(=+2)(v+2)+2(2-3)(y-2) + 2(2+2) (+2) +2(2-) (0-9) 














4no,? 
p> 2/ 2 
But o7 = ko,’ — tf + ane —e 2n 9d?/4_ 
1 
= Tr d? 
Or tw = a3 + 5/40 = + 5/4g5 


The numerical Be of the last term in each case may be 
Nt — 
obtained as before and have the same value, 12 Hence, in 








Probable Error of Measurement 413 


























r, , @(N*—1) 1 d?(N? — 1) 

general, ty = jy + 12k%o,? pl 7 iat | 

2 S wt 
But k = \ +° ale y sell wan, 

c?(N? —1) 

fe Ft cal 12r, + c°(N?—1). 
Therefore, ry = c(N?—1) 12+c%(N?—1) 

.* 12 
An equivalent form obtained by a slight algebraic change is ry = 1 


— TEP CWE): This is formula (2) of the article, 





3. Composite PE of Estimate.—This is obtained by using the rela- 
tion, ey = .67450y~/1—ry?. In this substitute for oy and ry the 
values given by (1) and (2). 


. Se! + alt on 12(1 — 11) _7. 
Then ey = 67450141 _ 1-|1~ poe! 
12+ c(N? — 1) 












































ey = .67450, 12 
we 24c2(N? — 1) — 144r,? 24c?(N? — 1)r,. 
[12 + c2(N? — 1)]? 

= 6745 12+¢?(N?—1) 12[12 —12r,?+2c#(N?—1)—2c2(N?—1)ri), 
ere 12 (12 + c2(N? — 1)}? 

m 12 — 12r,? + 2c?(N? — 1) — 2c*(N? — 1)ri. 
ey = .674501\/—— = ae 4 c?(N2 es 1) 

ae, 2 2 a om 

ey = .67450; 12(1 — ri’) + 2c?(N? - Ya r1), 


12 + c(N*? — 1) 





c(N? — 1)(1 — 71)? 
12 + c?(N? — 1) 
__c%(N? — 1)(1 — 71) |: 
[12 +c?(N?2 — 1)](1+ 11) 
NF D—r) 
[12 + c*(N? — 1)]\(1 + 71) 





éy = 6745014)(1 — #,°) + 








éy = 6745014)(1 - r?)/ 1 + 





ey = .674501+/1 — 7:2 - \ 1+ 
But .674501:~/1 — r,?2 = e1. 
c?(N? — 1)(1 — 11) 


Therefore, ey = ell + (12+ c%(N?—- Da+r) 
This is formula (3) of the article. 

















et ee OS 
“ 


* . a NT a ae 
fe Paget» 0m - Mt 





THE VALIDATION OF INTELLIGENCE TESTS 
A. M. JORDAN 


University of North Carolina 
(Continued from September issue) 


Thus far little reference has been made to the correlations obtained 
by other investigators. It has seemed to the writer worth while to 
project his findings on the background of those of others. He has 
consequently gathered together a tentative bibliography composed of 
references which contribute in some way to the question of test 
validation. These articles have been analyzed and the results tabu- 
lated so that because of the large amount of work which has been 
done on some tests we can determine pretty surely just where we are 
in the meaning of correlations which will be computed. The tests 
will be taken up in turn and the inferences drawn concerning them. 


TaBLE XIV.—DIsTRIBUTION OF CORRELATIONS BETWEEN ALPHA AND HIGH 
ScHoot GRADES 


FREQUENCY 
Army Alpha .19 2 
.21 
.23 2 
.25 
.27 2 
.29 
31 1 
.33 
.35 3 
.37 5 
.39 
.41 3 
.43 
.45 2 
.47 2 
.49 2 
51 2 
Number 26 
Median 38 
Range 19-52 


These correlations have been computed under a variety of condi- 
tions which appear in the diversity of results. It is evident that a 
correlation of .50 with high school grades is a high correlation; 7.e., 
it is high compared with other computed r’s with the same criterion. 

414 





Validation of Intelligence Tests 415 


The coefficient of .48 obtained by me is well up toward the highest 
obtained by others. Moreover, the average of six coefficient between 
Alpha and English is .39; of eight between Alpha and history .33; of 
five between Alpha and general science .44; and of five between Alpha 
and Mathematics .36. 


Alpha 1 .06 
.14 
.27 
.27 
.26 
.16 
18 
37 


When sub-tests are considered only in the case of Burtt and Arps 
(12) is there comparative data. In their study we find the following 
which differ very decidedly from the writer’s findings. Especially 
in the first four tests is this difference most apparent, for while their 
average of these four is .185 that of this study is .48. Which is the 
more typical only other investigators will discover. The correlations 
of these sub-tests with mathematics, English, general science, and 
history will have to wait for other investigators to be checked up and 
compared. 

Between Alpha and mental age five coefficients have been dis- 
covered with an average of .72. The sub-test averages are given in the 
following table: 


ON aor W DY 


TaBLE XV.—CORRELATIONS OF THE SUB-TESTS OF ALPHA WITH MENTAL AGE 

AVERAGE OF r's N 
.46 
.67 
.53 
.68 
.61 
.54 
.55 
.61 


Group test .72 


= 
=) 
= 
2 
— 


;ONQar wh 
alwmowwnwwh wd 


The correlation of .68 obtained in this study with mental age is 
seen to be somewhat lower than the average. There are some 
differences also among the sub-tests. 

The correlation of .61 with teachers’ estimates resembles the .50 
to .70 of the officers’ estimates and faculty ratings of .59. As far as 


———- 
ne 


- 


es ee 


Se ——- 


— a 
a ae Re. 5 


a a 
~ 











ig 


53 
' 
al : 
4} 
4 
yi 
: 


a 


416 The Journal of Educational Psychology 


the writer can discover no publication has been made of correlations 
of teachers’ estimates with sub-tests. 


The average correlations of Alpha with other tests are 


ctw wis Gas o4.sc0 eee 4 .75 in 3 cases 
ina og 5 ph he, fibk > 4-44 Kee web .76 in 5 cases 
ILS lu are UG ba hws Sk chee ee .75 in 4 cases 
With Otis self-administering............ .76 in 1 case 
: With University of Minnesota Tests.... .73 in 1 case 
With Haggerty, Delta 2............... .78 in 1 case 
| With Myers Mental Measure.......... .35 in 1 case 


The negative correlation of —.29 with age is matched with a —.29 
obtained by Madsen and Sylvester (35). 


TaBLE XVI.—DIstTRIBvTION oF CORRELATION COEFFICIENTS BETWEEN ALPHA 
AND UNIVERSITY AND NorRMAL ScHoot MARKS 
FREQUENCY 

22 

24 

26 

28 

30 

32 

34 

36 

38 
40 
4 42 
mt 44 
Moe 46 


mB wow = 


ee cee catty a AN A RR AG, OS a 


¥ 48 
i) 50 
al 52 

54 
MES 56 
™ 58 


=~ mem Oe OR eR Ne Pe Or 


; 60 
bu 62 
rt} 64 
fea) 66 1 





oa Number 35 
by Median 415 
ed. y Range 22-67 

The correlation of .48 obtained by the writer in a previous investigation is 
slightly above the median of .415 and finally for this test there have been 
several correlations computed between sub-tests and University marks 








T. 





Validation of Intelligece Tests 417 


TaBLE X VII.—CoRRELATIONS OF SuB-TESTS OF ALPHA WITH Marks OBTAINED IN 
UNIVERSITY AND NoRMAL SCHOOL 


SvuB-TESTs AVERAGE NUMBER RANGE 
1 ae 11 — .01 to .34 
2 .27 12 — .10 to .39 
3 .17 1l — .09 to .43 
4 .32 12 — .16 to .55 
5 .25 12 — .15 to .45 
6 .21 1l — .01 to .40 
7 .29 12 .09 to .41 
8 .29 12 .15 to .50 


Thus there is no consistency of correlations obtained. It would 
appear as if some investigators have (1) made errors in computation, 
or (2) have had groups differing widely in homogeneity, or (3) from 
some other reason or combinations of reasons have procured results 
diverging widely from the average. Consider a negative correlation 
of .15 with Test 4 and grades. Test 4 is composed of opposites. The 
average of twelve correlations with grades in the case of this test is .32 
and yet one investigator obtains a coefficient of —.15. The preponder- 
ance of evidence is against the correctness of such findings. 

This test (Army Alpha) has good reliability, being above .90. 

Otis Group Test.—The Otis group test has been used very widely in 
the high schools and furnishes abundant data for purposes of com- 
parison. The following table sets forth the correlations from grade IV 
through the high school. 


TaBLeE X VIII.—AVERAGE CORRELATIONS OF Otis Test witH ScHOooL MARKS IN 
Various GRADES (LARGELY FROM COLVIN 14) 








Grade Average | Number | Grade Average | Number 
oh ee © Sait a pee? 
IVA | .78 2 VIIIA .73 | 3 
IV | .62 5 VIII 68 | 7 
VA | .56 2 || IXA at oe 
V | .69 5 | IX ere 
VIA | 61  @ xX 36 | 1 
VI | 5 XI and XII 42 | 1 
VIIA | .65 4 Whole High School .44 6 
VII 71 5 
| 














Average Coefficients Grades IV—-VIII .66 in 40 cases; Range .33-.91. 
Average all high schools .49 in sixteen cases; Range .31-.82. 





——S 


a - 
epimers 
a 
er 


aN, 


ae ee ae 
ios. MGs PME ae 
a a © ¥e Fy: 


p< sede ee 


geass ee 


Ray se 
™%. 


SF ees 
it 
Prison 
> 
ao - 





= ee 
es 
oe 


ied & SM od 
9 TR a er 














eS PRs 


ou ees s 


on ad 


art 
IE ap, 
> 


Pe. 





x to 
ue 


418 The Journal of Educational Psychology 


This average correlation of .49 is not much different from .45 obtained 
by the writer in another investigation (31). This may well be com- 
pared with an average of .60 with grades when form A and form B of 
this group test are both used. The correlations with grades in normal 
schools in three cases averaged .28 but the groups were quite homo- 
geneous. With the Stanford-Binet the Otis test has a fairly high 
correlation. The average for grades V to VIII was in six cases .66 with 
a range of .46-.76; for the high school .63 with a range of .44 to .73; 
while, when grades V to XII were included the correlation was .80 in 
one case. This average correlation of .63 in the high school cor- 
responds very closely with the .66 of this investigation. There are 
few or no publications of correlations of sub-tests with Stanford- 
Binet or with any other individual test. Otis correlates negatively 
with age. The three coefficients found have an average of — .43 
with a range of — .41 to — .45 giving substantiating evidence to the 
excellence of the test in this particular. 

Among those who have computed correlations with other tests 
probably Franzen (21) has done as much as anyone else. Many of 
the correlations below are taken from his findings. 


TaBLE XIX.—CorRELATIONS oF OT1s Test witH OTHER TEsTs. NUMBERS IN 
PARENTHESES INDICATE THE NUMBER OF 7’s ENTERING INTO THE AVERAGE 


(Data largely from Franzen) 
Hotter Va Detta' ALPHA MitterR TeRMAN NATIONAL A MENTIMETER 
.66(6) .82 .78(2) .76(3)  .78(4) .76(3) .63(3) 
Reading 
ES avin edbl Caan! Rees CERO MN wns ab wee .81 .72 
THORNDIKE 
THURSTONE HAGGERTY ILLINOIS PRESSEY READING NATIONAL B 
.60 .78 .90 .37 .83 .73 
Reading 
Constant. ...... .39 .67 .35 as .65 
DEARBORN-1 DEARBORN-2 PREssEY XouT WYLIE Myers Univ. or MInn 
.60 .55 .74 .65 .56 .56 
Reading 
Constant. .56 44 .69 .30 .65 


Thus it is seen that correlations of this test with other tests called 
by the same name and purporting to measure the same thing, range 
all the way from .37 to .90. If only we could discover which one really 
measured intelligence the rest would be easy. Correlations of Otis 











Ll 
( 
é 
( 
' 


Validation of Intelligence Tests 419 


tests with three separate composites reveal interesting results. Otis 
correlated with a composite of Otis, Alpha, Miller, and Terman gives 
a correlation coefficient of .93; this test correlated with a composite 
of National A and B, Haggerty Delta 2, Otis, Myers, and Kelly- 
Trabue Language gives .68; and Otis correlated with the average of 
Terman, Otis, Thurstone, and Rogers gives an r of .97. 

The correlations obtained with teachers’ estimates of intelligence 
and the learning test have no other coefficients with which they might 
be compared. 

Terman Group Test—The Terman group test of intelligence has 
not been on the market as long as the preceding two but still has been 
fairly widely used. With school marks the correlations are quite 
similar to those of the preceding tests. 

With marks the average of correlations is .47 in nine cases, with 
a range of .30 to .67. Hines**® obtained results quite discrepant from 
these, his coefficients averaging only .24 in nine cases. The writer is 
inclined to think that Hines’s results are not representative. The 
correlation of this test with marks in Latin is .65 in one study; with 
algebra .50 in one case; with history .40 in five cases; with mathematics 
.26 in thirteen cases with a range of —.18 to .44; with science .14 in 
one case; and with general science .54 in four cases. The correlations 
of the sub-tests with grades have no other coefficients with which they 
could be compared. However, Professor Terman sent me the follow- 
ing correlations with corrected grade location which, according to him, 
is the best measure of “educated”’ that we have. These correlations 
were made with sub-tests which were slightly longer than those which 
now compose the test. 


TERMAN CorrREcTED GRADE 
LOCATION 


1 597 
.554 
. 663 
.596 
.48 
.644 
. 64 
.58 
54 
.373 


SCO ONO OP W WO 


—_ 


Therefore this test shows a high correlation with the amount of school 
knowledge which individuals have acquired. This corroborates to a 


¥ 
’ 
: 
” 
" 
? 
ver) 4 
aD 
iY 
‘ 
iy 
a 


Bes =r 


eee ee 
_ — 


rs 


—— 


ie 











- 


(enone ab an oe A EBS 





ie.) (oan of Be. ti Sweet ~~, 
% le ee a ans Se ee a ae” me , td 
- = 


Re oe De 
re 2 eget 


ute 
San 


wT m7 - 
ees ARS 
— 











420 The Journal of Educational Psychology 


certain extent the writer’s findings*! that “For all subjects com- 
bined Terman stands above the rest because of Test 1 with a coefficient 
of .555 and because the correlation between the group tests and all 
subjects is .492.” 

The correlation of the tests as a whole with the Stanford-Binet 
ranges from .35 to .75 with an average at .64 which is somewhat below 
the present finding of .68. 

Here again the reliability is high. 

In collecting correlations with other tests the writer found the work 
of Franzen (21) a storehouse of information. 


TaBLE XX.—CoRRELATIONS OF TERMAN WITH OTHER TESTS. THE NUMBERS IN 
PARENTHESES REPRESENT THE NUMBER OF 7’s ENTERING INTO THE AVERAGE 
(Data largely from Franzen) 


Composire NaTIonaL A Oris HAGGERTY ILLINoIs Pressey SuRVEY 
Reading . 92 .85 .80(5) .85 .83 . 82 
Constant. oO . 68 . 67 .68 .62 .70 
MENTIMETER NaTionaL B- DEARBORN-1 DEARBORN-2 Pressey Xovur 
Reading .89 .76 .68 .70 .58 
Constant. .92 . 66 . 64 .67 .38 
MYERS MILLER Avena U.or Minn. Tuurstone Rocers Cuxicaco 
.55(2)  .73(4) .75(4) . 64 . 86 .61 . 74(2) 
Reading 
Constant. .42 .58 


The correlations with individual tests range from .55-.89 with an 
average at .72 which should be compared with the coefficients obtained 
in this study of .79 with Miller, of .71 with Alpha; and of .78 with Otis. 

With age the correlation is negative but not quite so largely 
negative as was Otis, the average of five coefficients being here —.33. 

Miller Group Test.—This group test has been off the press such a 
short time that comparisons can hardly be made. Iam indebted to its 
maker for several correlations which I shall give. The correlations 
with school marks are .56 and with a grammar test .39. The average 
correlations found by the writer in a previous study*! is .48. 

With teachers’ estimates, Stanford-Binet, and the learning test 
there are no coefficients for comparative purposes. Particularly inter- 
esting would be the correlations with Stanford-Binet with this test 
since the Miller test ranked lowest with this criterion. 

There are some interesting correlations with other tests. 


nt 
il] 


at 


EY 


IT 


ee ed 


\v Fa Sa aa 


Validation of Intelligence Tests 421 


TaBLE XIX.—CorRELATIONS OF MILLER Group TEsT WITH OTHER INTELLIGENCE 
TEsTs 


(Data furnished largely by Miller) 
Numbers in Parentheses show Number r’s entering Average 
Univ. oF Minn. MENIMETER MYERS OTI8 TerMaN Detta2 AwtPHA THORNDIKE 


.72(2) .69(3) .382(2) .76(4) .73(4)  .78(1) .75(4) .82 
Thus the correlations range from .32 to .82 with the average at .69. 


When composites are made Miller stands well. The coefficient 
of correlation of Miller with a composite of Miller, Terman, Alpha, and 
Otis is .90; with a composite of Miller, Delta-2, Terman A, Alpha, and 
Mentimeter is .90; while a composite of Univ. of Minn., Mentimeter, 
Thurstone IV, Myers Mental Measure, Otis, Terman, Miller A and B 
has a correlation of .88 with Miller B and .85 for Miller A. 

The reliability coumicsent ranges from .86 to .91 with an average 
at .90. 


SUMMARY 


The summary will be given in order according to the plan already laid 
down on pages 350 and 351. 

I. (1) The correlations of the four group tests (Alpha, Miller, Terman, 
and Otis) with mental age (Stanford-Binet) are fairly high. Three 
of them hover around .68 while the fourth (Miller) has a correlation 
considerably lower (.53), Of the sub-tests, Alpha-4 (Opposites) 
has the highest coefficient (.61) but several others stand well in this 
respect. Some of these latter are: Otis-2 (Opposites). 59, Otis-4 
(Proverbs) .57; and Terman-3 (Opposites) .57. It is noteworthy 
that any of the sub-stests mentioned have a higher correlation with 
mental age than the Miller group test. 

2. With age the correlations in all group tests and with all mn 
tests save one are negative. The one exception is Terman-9 (Classifi- 
cation). This substantiates the usual dictum that in not too large a 
range the younger pupils are on an average the brighter. The highest 
negative correlation is —.455 with Otis-9 (Story Completion), this 
being even higher than the correlations with age of any of the group 
tests taken as a whole. 

3. With average grades the correlations of the four group tests are 
grouped around .47. The lowest is .45 (Otis) the highest is .49 (Ter- 
man). Among sub-tests Terman-1 (Information) has the highest 
correlation (.55) while Otis-10 (memory) has the lowest (.14). When 


et ow 


a : ™ — . 

* > Pa " 
o—~ ee “2 Sy ° 

op ha > a So. - — 


ge 


Se. —- 
CAR 5A 


ee 


RR ES ae 








+2 eee See 


OP a ee Si 
» e Ye aad = 
6 a RE Seen cher 


>» ia 7 rT 
ee eS 


ie as 











422 The Journal of Educational Psychology 


individual subjects in the high school curriculum are considered we 
find that Miller correlates highest with English (.56) while Otis has the 
lowest coefficient (.47). Moreover among the sub-tests Miller-1 
(Mixed Sentences) correlates highest of all (.59). The highest and 
lowest coefficients in other subjects are: Mathematics: highest Otis-5 
(Arithmetic Problems .68), lowest Otis-3 (Disarranged Sentences 
.035) ; General Science; highest Terman Group .64, lowest Otis-8 (Simi- 
larities .10); and History: highest Terman-6 (Sentence Meaning .59), 
lowest Otis-1 (Hard Written directions —.21). 

4. The coefficients of correlation between intelligence tests with 
teachers’ estimates are unusually high, no group test being below .60 
and one (Otis) having the high correlation of .73. The sub-tests also 
show in general significant correlations with this criterion ranging from 
.24 in Alpha-1 (Hard Oral Directions) to .63 with Terman-3 (Oppo- 
sites) and .62 with Alpha-2 (Arithmetic Problems). It may be observed 
that as far as teachers’ estimates of intelligence of the last two 
mentioned is as good as Army Alpha taken as a whole. 

5. The correlations with learning test are uniformly low. The 
coefficients in general are grouped around .20, the highest being .31 
with Alpha-2 (Arithmetic) and the lowest —.125 with Terman-8 (Mixed 
Sentences). Among the tests taken in groups Otis with .23 is highest 
and Miller with .17 is lowest, although “‘high”’ and “‘low” here have 
little significance. 

6. When a composite of the four tests is made and this used as a 
criterion, and correlations made with it, the coefficients in general are 
unusually high. The group tests correlate around .90 with little to 
choose between them while among the sub-tests Terman-3 (Opposites) 
has the unusually high correlation of .83, and Alpha-1 (Hard Oral 
Directions) has the lowest of all, .395. 

II. In all cases but one (grades) the factor of age was made constant 
by means of the coefficient of partial correlation. The result of this 
statistical treatment was to decrease the coefficient in practically all 
cases from .01 to .06. 

III. (1) The average displacements from corresponding thirds when 
two tests were compared ranged from .28 to .47 per cent depending 
upon the degree of similarity between the tests, Otis and Miller having 
the smallest displacement and Terman and Alpha the largest. 

2. When correlations in the regression equation were assumed to 
be perfect and transmutations were made from the three tests to the 
fourth and the average differences computed the largest average differ- 








c 
i; 
N 
‘7 
V 





’ ae Ss = * \w Vw 


Validation of Intelligence Tests 423 


ence was in the case of Terman, 15.4 units; the smallest with Miller, 
8.6. However, when each of these averages is divided by the SD’s of 
the respective tests Otis has the smallest, 2.45; Miller next 2.46; 
Terman next 2.59; and Army last 2.63. 

IV. The average of the collected correlations of the tests with various 
criteria is quite similar to the findings already reported. These are 
Alpha .38 (high school), .42 (University), Otis .44, Terman .47, and 
Miller .56. In the case of mental age Alpha stands first with .72, 
Terman next .64, and Otis .63. I discovered no correlation of Miller 
with mental age. With teachers’ estimates only Alpha, ranging from 
.50 to .70, could be found. With age the two discovered are quite 
similar to the results obtained in this study, Otis —.43 and Alpha —.29. 
There are correlations made also with various composites but no clear 
cut inferences can be drawn, and finally correlations have been made 
with other tests with varying results. 

V. A variety of studies has been made, using the tests discussed 
here. Sometimes merely tabulated results of the scores have 
been given; sometimes rather elaborate statistical procedures have 
been used with some interpretation of results. Only thelatter have been 
included in the present bibliography and among these, all have been 
omitted which had neglected the idea of comparison of the tests with 
some criterion. In general there has appeared a feeling of disappoint- 
ment. In one extreme case so strong a word as “contra-indicated”’ 
is used by Bridges in reference to the helpfulness of the tests in fore- 
telling success in college subjects by Alpha tests. On the other hand 
some (among them Colvin) have shown in a convincing way the value 
of the tests in a variety of relations. 

VI. In considering sub-tests no other test stands so high with all 
criteria as Opposites. Arithmetic Reasoning stands next to Opposites 
as a useful sub-test for a group test of intelligence, then comes in order 
Geometric Figures and Proverbs. 

VII. As far as our data go, Otis is the best all round test for testing 
intelligence at the high school age; Terman ranks next; Alpha third, and 
Miller fourth. Terman and Alpha are so close together that they 
almost tie for the second place. This inference assumes that prac- 
tically all the criteria are of equal value. But weighing them almost 
as you choose Otis would be ahead. 

VIII. Discussion and Conclusion.—If we ask now what do tests 
measure this investigation can not answer definitely. Certain findings 
however, do throw light on this question. Each of the four tests used 


\ 


a. 

‘ 

i 

1 

i 7 
ai 

’ 

7 | 

: 
7 
yh! 
: , 
i' 
ai) 

. 5 
‘a 
an 

b Ts) 
} 
f 
ip 
Hh é 
PE 














BRE IO SEES See) a eet LO 


-s 
an ena 
eet) Se ee 


422 The Journal of Educational Psychology 


individual subjects in the high school curriculum are considered we 
find that Miller correlates highest with English (.56) while Otis has the 
lowest coefficient (.47). Moreover among the sub-tests Miller-1 
(Mixed Sentences) correlates highest of all (.59). The highest and 
lowest coefficients in other subjects are: Mathematics: highest Otis-5 
(Arithmetic Problems .68), lowest Otis-3 (Disarranged Sentences 
.035) ; General Science; highest Terman Group .64, lowest Otis-8 (Simi- 
larities .10); and History: highest Terman-6 (Sentence Meaning .59), 
lowest Otis-1 (Hard Written directions —.21). 

4. The coefficients of correlation between intelligence tests with 
teachers’ estimates are unusually high, no group test being below .60 
and one (Otis) having the high correlation of .73. The sub-tests also 
show in general significant correlations with this criterion ranging from 
.24 in Alpha-1 (Hard Oral Directions) to .63 with Terman-3 (Oppo- 
sites) and .62 with Alpha-2 (Arithmetic Problems). It may be observed 
that as far as teachers’ estimates of intelligence of the last two 
mentioned is as good as Army Alpha taken as a whole. 

5. The correlations with learning test are uniformly low. The 
coefficients in general are grouped around .20, the highest being .31 
with Alpha-2 (Arithmetic) and the lowest —.125 with Terman-8 (Mixed 
Sentences). Among the tests taken in groups Otis with .23 is highest 
and Miller with .17 is lowest, although “‘high”’ and “low” here have 
little significance. 

6. When a composite of the four tests is made and this used as a 
criterion, and correlations made with it, the coefficients in general are 
unusually high. The group tests correlate around .90 with little to 
choose between them while among the sub-tests Terman-3 (Opposites) 
has the unusually high correlation of .83, and Alpha-1 (Hard Oral 
Directions) has the lowest of all, .395. 

II. In all cases but one (grades) the factor of age was made constant 
by means of the coefficient of partial correlation. The result of this 
statistical treatment was to decrease the coefficient in practically all 
cases from .01 to .06. 

III. (1) The average displacements from corresponding thirds when 
two tests were compared ranged from .28 to .47 per cent depending 
upon the degree of similarity between the tests, Otis and Miller having 
the smallest displacement and Terman and Alpha the largest. 

2. When correlations in the regression equation were assumed to 
be perfect and transmutations were made from the three tests to the 
fourth and the average differences computed the largest average differ- 











Validation of Intelligence Tests 423 


ence was in the case of Terman, 15.4 units; the smallest with Miller, 
8.6. However, when each of these averages is divided by the SD’s of 
the respective tests Otis has the smallest, 2.45; Miller next 2.46; 
Terman next 2.59; and Army last 2.63. 

IV. The average of the collected correlations of the tests with various 
criteria is quite similar to the findings already reported. These are 
Alpha .38 (high school), .42 (University), Otis .44, Terman .47, and 
Miller .56. In the case of mental age Alpha stands first with .72, 
Terman next .64, and Otis .63. I discovered no correlation of Miller 
with mental age. With teachers’ estimates only Alpha, ranging from 
.50 to .70, could be found. With age the two discovered are quite 
similar to the results obtained in this study, Otis —.43 and Alpha —.29. 
There are correlations made also with various composites but no clear 
cut inferences can be drawn, and finally correlations have been made 
with other tests with varying results. 

V. A variety of studies has been made, using the tests discussed 
here. Sometimes merely tabulated results of the scores have 
been given; sometimes rather elaborate statistical procedures have 
been used with some interpretation of results. Only thelatter have been 
included in the present bibliography and among these, all have been 
omitted which had neglected the idea of comparison of the tests with 
some criterion. In general there has appeared a feeling of disappoint- 
ment. In one extreme case so strong a word as “contra-indicated”’ 
is used by Bridges in reference to the helpfulness of the tests in fore- 
telling success in college subjects by Alpha tests. On the other hand 
some (among them Colvin) have shown in a convincing way the value 
of the tests in a variety of relations. 

VI. In considering sub-tests no other test stands so high with all 
criteria as Opposites. Arithmetic Reasoning stands next to Opposites 
as a useful sub-test for a group test of intelligence, then comes in order 
Geometric Figures and Proverbs. 

VII. As far as our data go, Otis is the best all round test for testing 
intelligence at the high school age; Terman ranks next; Alpha third, and 
Miller fourth. Terman and Alpha are so close together that they 
almost tie for the second place. This inference assumes that prac- 
tically all the criteria are of equal value. But weighing them almost 
as you choose Otis would be ahead. 

VIII. Discussion and Conclusion.—If we ask now what do tests 
measure this investigation can not answer definitely. Certain findings 
however, do throw light on this question. Each of the four tests used 











—— ee oe 


ete ee en er 


» »4 
ae ee 


ee 











Ra De. 


Sater 


a 











|e el ere en es ee, ee be 
te pt aa etna we nee >. 

+ * ‘* eh GE eas - lag 
= : “ 


Nae 


424 The Journal of Educational Psychology 


is related to the several criteria used in varying degrees of closeness. 
We now know that either “‘capacity to learn” is not a good definition 
because it has been demonstrated that at least in one simple 
case of ideational learning there was only slight correlation with the 
test. My opinion is that intelligence comes more nearly being the 
capacity to learn when the material learned is difficult for the learner. 
The ‘‘capacity to learn” includes too much. 

The tests used do differ from each other but only in one case 
(mental age) widely enough to be significant. Apparentlylarge deter- 
mining differences turn out to be .03 to .07 which might be changed 
in another investigation. This in a multitude of discouraging findings 
is a hopeful sign. Did one know the weight to be attached to the 
various criteria and if they could be perfectly combined perhaps one 
composite criterion might be used with which tests would correlate 
much more highly than they do with any of them taken singly. 

Now for the various criteria used. Mental age is limited in two 
ways; ,(1) It limits the possible IQ of some at thirteen and fourteen, 
and of many at fifteen to sixteen and older. After thirteen or fourteen 
years if an unusually bright child should be measured from year to 
year his mental age would remain constant while his chronological 
age was increasing, thus causing a lowering of the IQ. This limits the 
usefulness of this instrument. It seems in some cases a poorer instru- 
ment than the group test for the IQ correlates only .57 with teachers’ 
estimates, .70 with composite, and —.16 with age, in each case lower 
than any one of the group tests. (2) The tests themselves are limited 
in depending too much on reproduction of haphazard numbers and 
definitions of words. One glaring error appears in putting in the 
question of distinguishing between “character” and ‘‘reputation”’ for 
at least among those persons that I have tested this difference has been 
learned by heart either in the copy book or in some other place. 

In teachers’ estimates, at least in this study, there appears the most 
important extra-test criterion. The estimates were so carefully made 
by teachers of mature opinions who had been trained to look for evi- 
dences of intelligence that the average results can be taken almost at 
face value. Those tests, therefore, which correlate highly with this 
criterion have in my opinion much in their favor. 

The learning test was a distinct disappointment. That it seems 
difficult enough may be determined by attempting to work out the 
letter occurring midway between the twelfth and eighteenth. More- 
over it appeared that twelve practice periods of three minutes each 


nm tf me es TH A A 





ill _ = "se 


Validation of Intelligence Tests 425 


would be time enough for considerable learning and that the average of 
the first three subtracted from the average of the last three would be 
at least “capacity to learn,’”’ and yet the correlations with it were 
exceedingly low. It would be interesting to try correlations with the 
learning of more difficult material. 

Age seems an important enough factor but not to be depended 
upon too much since with a wider range the correlations become posi- 
tive and in that, at present, just the width of that range where the 
change takes place is not known. 

Composites always seem to me like loading the dice before the 
throw is made and are much poorer validating material than any of 
the other criteria. 

Grades are possibly next in importance to mental age. They have 
the advantage of extending over long periods of time and of eliminating 
ups and downs due to chance variations but are so complicated with 
other factors that intelligence is an unknown factor. 

The most hopeless results obtained were those concerned with 
variations in scores between two tests that have high correlations 
with each other. The only hope of ever improving our results that 
seems feasible is to throw out all cases of unusual variations and test 
them individually. One might lay down this safe dictum, that if the 
individual’s scores on various tests are within 2 PE of the average 
variation of tests from each other (see page 353) then their scores 
should be thrown out and they be tested further. 


RECOMMENDATIONS 


It is recommended that the National Research Council be prevailed 
upon to undertake a great testing program in order to perfect a 
standard series of group tests of intelligence suitable for testing high 
school pupils and adults. 


SELECTED BIBLIOGRAPHY BEARING ON CORRELATIONS 


1, Almack, John C. and Almack, James L.: Gifted Pupils in the High School. 
School and Society, Vol. XIV, 1921, pp. 227-228 (Alpha). 

2. Anderson, John E.: Intelligence Tests of Yale Freshman. School and 
Society, Vol. XI, 1920, pp. 417-420 (Alpha). 

3. Angier, Roswell, P.: Yale Freshmen Intelligence Tests. Yale Alumni 
Weekly, Feb. 6, 1920, pp. 454-455 (Alpha). 

4. Arps, Geo. F.: Intelligence Tests and Their Applications. Proceedings 
First Annual Educational Conference Ohio State University, pp. 25-31 (Alpha). 


| 

| 
! 
' 
} 
| 
| 











et 


OP a PN ot el 
“ ee 











426 The Journal of Educational Psychology 


5. Arthur, Grace and Woodrow, H.: An Absolute Intelligence Scale; a Study 
in Method. Journal Applied Psychology, Vol. III, 1919, pp. 118-137 (Sub-tests). 

6. Benson, C. E.: Results of Army Alpha in a Teachers’ Training Institution. 
Journal Educational Administration and Supervision, Vol. VII, 1921, pp. 348-349 
(Alpha). 

7. Breed, F. S.: Shall We Classify Pupils by Intelligence Tests. School and 
Society, Vol. XV, 1922, pp. 406-409 (General). 

8. Breed, F. S. and Breslich, E. R.: Intelligence and the Classification of 
Pupils. School Review, Vol. XXX, 1922, pp. 51-66, 210-226 (Otis, Terman). 

9. Bridges, J. W.: The Correlation between College Grades and Alpha 
Intelligence Tests. Journal Educational Psychology, Vol. XI, 1920, pp. 361-367 
(Alpha). 

10. Bridges, J. W.: The Value of Intelligence Tests in Universities. School 
and Society, Vol. XV, 1922, pp. 295-303 (Alpha). 

11. Bright, Ira J.: The Intelligence Examination for High School Freshmen. 
Journal Educational Research, Vol. IV, 1921, pp. 44-55 (Terman). 

12. Burtt, H. E. and Arps, G. F.: The Correlation of Army Alpha Tests with 
Academic Grades in High Schools and Military Academies. Journal Applied 
Psychology, Vol. IV, 1920, pp. 289-293 (Alpha). 

13. Coddington, E. F.: Correlation between Army Intelligence Tests and 
College Records. Engineering Education, Vol. XI, 1921, pp. 311-318 (Alpha). 

14. Colvin, 8. S.: Some Recent Results Obtained from Otis Group Intelligence 
Seale. Journal Educational Research, Vol. III, 1921, pp. 1-12 (Otis). 

15. Colvin, 8. S.: Psychological Tests at Brown University. School and 
Society, Vol. X, 1919, pp. 27-30 (Alpha). 

16. Colvin, S. S.: Validity of Psychological Tests for College Entrance. 
Educational Review, Vol. LX, 1920, pp. 7-17 (General). 

17. Colvin, 8. S. and MacPhail, A. H.: The Value of Psychological Tests at 
Brown University. School and Society, Vol. XVI, 1922, pp. 113-122 (Alpha). 

18. DeCamp, J. E.: Studies in Mental Tests. School and Society, Vol. XIV, 
1921, pp. 254-258 (Alpha). 

19. Davis, Homer: Army Alpha and Students’ Grades Illustrating the Value 
of the Regression Equation. School and Society, Vol. XIV, 1921, pp. 223-227 
(Alpha). 

20. Dickson, V. E. and Norton, J. K.: Otis Group Intelligence Test Applied to 
Elementary School Graduating Classes. Journal Educational Research, Vol. III, 
1921, pp. 106-115 (Otis). 

21. Franzen, Raymond: Attempts at Test Validation. Journal Educational 
Research, Vol. IV, 1922, pp. 145-158 (Otis and Terman). 

22. Freeman, F. N.: Mental Tests. Psychological Bulletin, Vol. XVII, 1920, 
pp. 352-362 (Bibliography). 

23. Gambrill, Bessie Lee: Some Administrative Uses of Intelligence Tests in a 
Normal School. Twenty-first Yearbook National Society for Study of Education, 
1922, pp. 223-243 (Otis and Terman). 

24. Garrison, 8. C. and Tippett, J. S.: Comparison of Binet-Simon and Otis 
Test. Journal Educational Research, Vol. VI, 1922, pp. 42-48 (Otis). 

25. Gates, A. I.: The Correlations of Achievement in School Subjects with 
Intelligence Tests and Other Variables. Journal Educational Psychology, Vol. 
XIII, 1922, pp. 129-139, 223-235, 277-285 (Otis and Terman). 








s). 
yn. 
49 


ha 
67 


n. 


th 
ed. 


ad 


ce 


Validation of Intelligence Tests 427 


26. Hines, Harlan C.: A Program for Lowering the Percentage of Failures. 
School and Society, Vol. XIII, 1921, pp. 582-584 (Terman). 

27. Hoke, Elmer: Intelligence Tests and College Success. Journal Educational 
Research, Vol. VI, 1922, p. 177 (Otis). 

28. Holley, C. E.: Mental Tests for School Use. University of Illinois Bulletin, 
No. 28 (Otis). 

29. Hunter, H. T.: Intelligence Tests at Southern Methodist University. 
School and Society, Vol. X, 1919, pp. 437-440 (Alpha). 

30. Jordan, A. M.: Some Results and Correlations of Army Alpha. School 
and Society, Vol. VII, 1920, pp. 354-358 (Alpha). 

31. Jordan, A. M.: Correlations of Four Intelligence Tests with Grades. 
Journal Educational Psychology, Vol. XIII, 1922, pp. 419-429 (Alpha, Miller, 
Terman, and Otis). 

32. Lippman, Walter: Discussion of Intelligence Tests. New Republic, Vol. 
XXXII, pp. 213-215, 246-248, 275-277, 297-298, 328-330, Vol. XX XIII, 1922, 
pp. 911 (Alpha). 

33. Madsen, I. N.: Group Intelligence Tests as a Means of Prognosis in High 
School. Journal Educational Research, Vol. III, 1921, pp. 43-52 (Alpha). 

34. Madsen, I. N.: Army Intelligence Tests as a Means of Prognosis in High 
School. School and Society, Vol. XI, 1920, pp. 625-627 (Alpha). 

35. Madsen, I. N. and Sylvester, R. H.: High School Students Intelligence 
Ratings According to Army Alpha Test. School and Society, Vol. X, 1919, pp. 
407-410 (Alpha). 

36. McGeoch, John A. Some Results from Three Group Intelligence Tests. 
School and Society, Vol. XVIII 1923, 196 (Alpha). 

37. Miller, W. S.: Administrative Uses of Intelligence Tests in the High 
School. Twenty-first Yearbook of the National Society for the Study of Edition, 
1922, 189-222 (Miller). 

38. Moore, H. T.: Three Types of Psychological Rating in Use with Fresh- 
men at Dartmouth, School and Society, Vol. XIII, 1921, 418-420 (Alpha). 

39. Peterson, H. A. and Kuderna, J. G.: Army Alpha in Normal Schools. 
School and Society, Vol. XIII, 1921, pp. 476-480 (Alpha). 

40. Proctor, W. M.: Psychological Tests and Guidance of High School Pupils. 
Journal Educational Research Monographs No. I, June, 1921 (Alpha). 

41. Proctor, W. M.: Psychological Tests as a Means of Measuring the Probable 
School Success of High School Pupils. Journal Educational Research, Vol. I, 1920, 
pp. 258-270 (Alpha). 

42. Proctor, W. M.: The Use of Tests in Educational Guidance of High School 
Pupils. Journal Educational Research, Vol. 1, 1920, pp. 369-381 (Alpha.) 

43. Psychological Examining in the U. S. Army. Memoirs of the National 
Academy of Sciences, Vol. XV, 1921 (Alpha). 

44. Root, W. T.: Correlations between Binet and Group Tests. Journal 
Educational Psychology, Vol. XIII, 1922, pp. 286-292 (Otis, Terman). 

45. Ruch, G. M. and Strachan, Lexie: Intelligence Ratings by Group Scales 
and by the Stanford Revision of the Binet Tests. Journal Educational Psychology, 
Vol. XI, 1920, pp. 421-429 (Alpha and Otis). 

46. Smith, W. H.: Otis Group Intelligence Test and High School Grades. 
School and Society, Vol. XII, 1920, pp. 71-72 (Otis). 


ke hate en 


~~ 


i 
; 
| 
i; 
i 
| 
| 


alia t = > 
= SS 


— -_ 
es 
5 


ree. 
a SS EE 


<r 


wer ener 
a See ee a ae 
——— 


aa are 2a. 
- _— = ee > 











=: 
= TESS wee, ae a 


ee 
Sty ; 


tena . 


OOO PaO a 








ao 


Peat. Soh ete RE eae 


a ae Oe FE tige i —* noe?! 


oe 





5 pee ee 
pig 2 Beg 
pA iS ti SLR 8 en 





428 The Journal of Educational Psychology 


47. Stenquist, John L.: Unreliability of Individual Scores in Mental Measure- 
ment. Journal Educational Research, Vol. IV, 1921, pp. 347-354 (Otis). 

48. Stenquist, John L.: The Case for the Low IQ. Journal Educational 
Research, Vol. IV, 1921, pp. 241-254 (Otis). 

49. Stevenson, Dwight H.: The Correlation between Intelligence Ratings and 
School Marks in Country Normal School Pupils. Ohio State University Educational 
Research Bulletin, Vol. I, No. 14, Sept. 13, 1922 (Terman). 

50. Terman, L. M.: The Great Conspiracy. New Republic, Vol. XXXIII, 
1923, pp. 116-120 (General). | ' 

51. Terman, L. M.: Intelligence Tests in Colleges and Universities. School 
and Society, Vol. XIII, 1921, pp. 481-494 (Terman and Alpha). 

52. Thorndike, E. L.: Organization of the Intellect. Psychology Review, Vol. 
XXVIII, 1921, pp. 141-151 (Alpha). 

53. Trabue, M. R.: Use of Intelligence Tests in Junior High Schools. Twenty 
First Yearbook National Society for Study of Education, 1922, pp. 169-188 (Otis). 

54. Valentine, R. E.: A Study in Intelligence and Educational Correlations. 
Journal Educational Research, Vol. III, 1921, pp. 207 (Otis). 

55. Van Wagenen, M. J.: Some Results and Inferences Derived from the 
Army Tests of the University of Minnesota. Journal Applied Psychology, Vol. IV, 
1921, pp. 59-72 (Alpha). 

56. Webb, L. W.: Ability in Mental Tests in Relation to Reading Ability. 
School and Society, Vol. XI, 1920, pp. 567-570 (Alpha). 

57. West, R. L.: An Experiment with the Otis Group Intelligence Scale in 
Needham, Mass. Journal Educational Research, Vol. III, 1921, pp. 261-272 (Otis). 

58. Wentworth, Mary M.: Army Alpha Tests and Teachers’ Estimates in 
Hollywood High School. School and Society, Vol. XII, 1920, pp. 58-60 (Alpha). 

59. Willard, Dudley W.: Native and Acquired Mental Ability as Measured by 
the Terman Group Test of Mental Ability. School and Society, Vol. XVI, 1922, 
pp. 750-756 (Terman). 

60. Wilson, W. R.: Mental Tests and College Teaching. School and Society, 
Vol. XVI, 1922, pp. 629-635 (Alpha). 

61. Wolcott, Gregory D.: Mental Testing at Hamline University. School 
and. Society, Vol. X, 1919, pp. 57-60 (Alpha). 

62. Yoakum, C. S. and Yerkes, R. M.: ‘‘Army Mental Tests.”” Henry Holtz 
and Co., 1920 (Alpha). 





ne. of 2a ooh 





PREDICTING ACADEMIC SUCCESS 


MARK A. MAY 
Syracuse University 


If all the factors that contribute to success in college were known so 
that intercorrelations could be obtained between them and also 
between each factor and some measure of success, a formula could 
then be written for predicting success. The general form of this 
formula would be: 


Success = X1,A+X2B+X:3C. . X,AN+K 


in which X,, X2, X; . . .X, are the factors and A, B,C, . .N 
system of weights so arranged to yield the maximum correlation 
between the right and left hand sides of the equation. Thus to pre- 
dict the success of a given individual all we need do is to substitute his 
score in each of the factors for X;, X2, X3, etc. and multiply each by its 
appropriate weight. The precision or exactness of the prediction will 
depend partly on the size of the correlation between the right and left 
hand sides of the equation (i.e., the correlation between actual success 
and predicted success in a large unselected sample) and partly on the 
size of the standard deviation of the measure of success. For the 
standard deviation of the distribution of errors is given by the formula: 
o,\/1 — R? in which og, is the standard deviation of the measures of 
success, and £ is the correlation between the actual measures of success 
and the predicted values. It is obvious that if R is large, say .90 or 
more, it does not matter how large, is for the whole thing approaches 
zero. But if R is as low as .80 it makes considerable difference how 
large o, is. 

This is merely a brief statement of the general mathematical 
principles on which such predictions depend and are well known to 
those acquainted with statistical methods. Our main problem is that 
of defining and measuring academic success and of discovering and 
measuring the elements that compose it. 

What is academic success? In general, it is intellectual achieve- 
ment, leadership in student affairs, development of a proper view of 
life, choosing a vocation, building character, etc., etc. For the present, 
at least, and until such a time as authorities may agree on a definition 
we must narrow it down to something definite and concrete. For the 
purposes of this investigation we shall define academic success as 
intellectual acheivement and assume that it is measured by college 


grades or marks. Or more precisely, we shall assume that the quality 
429 








‘* 
aD ee ee 


RR NS mE AE ET 

















mM 
4 
4 

sa 


430 The Journal of Educational Psychology 


of intellectual work is measured by marks and that quantity is 
measured by semester hours. The only defense for such assumptions 
is that these are our only available means of measurement. 

The purpose of this study is to ascertain how accurately the aca- 
demic success of 450 Liberal Arts freshmen could have been predicted. 
As a measure of their intellectual achievement we have their “credit 
points” or “honor points” obtained at the end of the first semester. 
The grade of A, which means very superior work, carries 3 honor points 
per semester hour; the grade of B gives 2 honor points: C gives 1 
honor point per semester hour and is regarded as an average grade; 
D is passing but gives no honor points. This X system is, like all 
others, entirely arbitrary but practical. It enables us to use the 
student’s total number of honor points as a measure of his intellectual 
achievement. Since the normal load for a freshman is 16 semester 
hours, the maximum number of honor points obtainable is 48. The 


honor points of the group under consideration were distributed as 
follows: 


Honor points......... 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 
Frequency............ 30 32 40 41 47 47 49 46 35 25 24161110 7 4 3 


Proceeding on the assumption that honor points measure intellec- 
tual achievement, our problem is to predict the honor points any 
student will make in a given time. Some of the factors on which 
intellectual achievement depend are: (1) General intelligence as 
measured by some standardized intelligence test; (2) preparatory 
school work as measured by the quantity of work (units) and by the 
quality of work (grades); (3) the industry or application of the student 
to his intellectual tasks measured roughly by the number of hours per 
week spent in study; (4) the mental efficiency of the student, or knowl- 
edge of how to use his mind, which factor we cannot measure; (5) 
interest in work, or strength of incentives to learn, or motives for 
being in college, another factor which defies measurement; (6) certain 
traits of character and personality factors; and finally (7) health and 
physical and social environment. This is by no means a complete 
list of the factors on which academic success depends. There are 
doubtless many other subtle elements that play a part. It remains for 
scientific investigations to isolate these and measure them. 

The general intelligence of the experimental group under considera- 
tion was measured by a combination of the Miller Mental Ability 
Test and the Dartmouth Completion of Definitions Test. The Miller 
Test contains 120 elements and the Dartmouth test contains 40; 


th 
co 








al ee ee =—W 


Predicting Academic Success 


431 


thus the maximum combined raw score is 160. The reasons for 
combining the raw scores is that this turned out to be about the opti- 
mum combination which gives the maximum correlation with honor 








points. The distribution of the combined scores is given below: 

Score Frequency Score Frequency 
140—- 1 70-79 25 
130-139 13 60-69 15 
120-129 27 50-59 6 
110-119 87 

100-109 103 

90-99 118 

80-89 45 














The correlation between general intelligence as measured by these 
tests and honor points obtained at the end of the first semester is 


+.60. 


Since this correlation is somewhat higher than those usually 


obtained between intelligence and college marks it seems advisable 
to present here the correlation table. 


Honor Points 







































































0 | 3 | 6 | 9/12/15/18|21/24'27/30'33 1285 
| | | 
140- yan de he hsa] owbacBea baba 2 
130- bah«s J--|--| 1) 1] 2) 2)..) 1). 3} 10~- 
120- ssjesfee|ee}eo] 3} 1) 4) 3) 2) 2 3} 3} 2|..| 25 
110- a 2) 2/ 4/10)11) 8) 9) 7/11) 7) 9 4) 1] 2)..| 90 
100- 5 | 4) 6| 9! 9) 9120/13! 9 4/10) 1 . ./102 
90- 8 |12|15} 9/13/14) 9/13/12) 6 5)..).. ../116 
80- 8 | 6| 6] 5| 4! 7 3] 3) 3). ik. |..| 45 
70- 2 | 4) 4} 3) 3) 5) 2) 2)..\.. oben «| 28 
60- 3 | 3| 4) 4) 11... safe ee e]e ole 15 
50-59 3 5 ae ae |p v|e 6 
r = +.60 


One interesting thing about this table is that individuals who score 
above the mean in intelligence (about 100 point) are more likely to 
make less than the mean number of honor points (about 18), than those 
who score less than the mean in intelligence are to make more than 


ee 


SET eer eeerte AT ew y 
i ee ewe - alae amo — ee 


- = 


+ hone, 





— 
fe 4 _ r bong haem - 





| 
; 


2 wm 


re ee 


i ee a a ge 








ene ce ee ten. 


— 7 — ‘aad oo +a = = an Y 
~* t ~—. ary S 7 < 
ac ee eR rn ate ores ate tent tan mall a Ne AER Mas 


432 The Journal of Educational Psychology 


mean in honor points. That-is, the correlation is higher in the upper 
ranges than in the lower. The reason is probably that some individuals 
may have a high intelligence rating and still do poor work in college 
because of lack of application, or poor methods of study, or something 
else of this nature, while on the other hand a student who is low in 
intelligence can make high marks only by good methods and great 
industry. Thus the tendency exhibited above may very well be due, 
in part at least, to the “‘tendency to least effort” so common among 
many college students, even the brighter ones. Indeed, we have data 
to show that those who had intelligence scores above the mean and who 
received less than the mean number of honor points actually studied 
from 4 to 10 hours a week less on the average than those of the same 
level of intelligence who received more than the mean number of honor 
points. All of this suggests that industry or application to work is an 
important factor. Presently we shall show that the partial correla- 
tion between intelligence scores and honor points with hours per week 
of study constant is +.805. 

As a measure of high school preparation, the number of units offered 
and the average grade obtained, were used. These are no doubt 
inadequate measures but are the best we have. The correlation 
between units offered for entrance and honor points obtained at the 
end of the first semester is +.22. One reason for this low correlation 
is the skewed distribution of units. This distribution is given below. 


Units 14 15 16 17 18 19 20 21 22 23 
Frequency 40 173 117 58 39 19 6 4 1 1 


The quality of the high school preparation was measured in terms 
of the average grade obtained in the work offered for entrance. If 
the student offered in addition his Regents Examination marks in 
the State of New York, or if he offered College Entrance Board 
Examination marks, these were used in every case as being more 
reliable than the high school averages. The correlation between 
these measures and honor points is +.405. 

Assuming that units is a measure of the amount, or quantity, 
of preparatory work; and that average grades is a measure of the 
quality of this work; and assuming further that quantity times quality 
is a better measure of work than either quantity or quality taken alone, 
we would expect to get a higher correlation from a combination_of 
these two. When we multiply them, that is when we weight the units 
by the grades obtained in them, the correlation of this product with 
honor points is +.24. This is little better than units alone and very 








= —_— OF met ett 


ball ee ere we 


Predicting Academic Success 433 


much worse than grades alone. A linear combination in which “‘best’’ 
weights are assigned give a combination which yields a maximum 
correlation with honor points of +.406 which is no better than average 
grades taken alone. 

Application to work, or general industry, was measured by 
the average number of hours per week each student spentinstudy. The 
value of such a measure obviously depends on its reliability. The . 
question here is how much dependence can be put in a student’s state- 
ment of the average number of hours per week he spends in study. 
The information was received in two ways. Early in the semester 
each student was asked to fill in a card telling how he spent his time 
each week. There was a blank space for ‘hours spent in sleeping,” 
“hours spent at meals,” etc. Among the many items was “hours 
spent in actual preparation of assigned work.”’ An effort was made 
to convey the impression that we were interested primarily in how 
they spent their total time rather than in the time spent on any one 
item. Thus the temptation to over-state the number of hours spent at 
study was partially overcome. Again after the middle of the semester 
each student was asked to fill in another card which called for certain 
information concerning each course. One item of information was 
the average number of hours per week spent studying each subject. 
These results were totaled and each total compared with the previous 
statement. The correlation between the two statements was surpris- 
ingly high, +.86. It could hardly be expected to be perfect since 
some students were studying more and some less at the middle of the 
semester than at the beginning. The statement used in this study was 
the one given at the middle of the semester since it was felt that the 
average number of hours per week spent in study for the middle weeks 
of the semester was more representative of the students’ industry than 
that of the first weeks. The correlation between this measure and 


‘honor points is +.32. 


So far we have made no attempt to measure such factors as mental 
efficiency, character traits, personality, health, and environmental 
influences all of which play a prominent part in academic work. Just 
how much each of these factors contributes to success in college remains 
to be determined. Our immediate problem is to determine what pre- 
dictions can be made from the factors which we have roughly measured. 

Ultimately the reliability of prediction depends on the correlation 
between the instrument of prediction and the thing to be predicted. 
In case there is more than one agency of prediction the intercorrela- 























: ie ae ec annie an 
poe erm nr > ee . 





434 


tions must also be known. 


The Journal of Educational Psychology 


Accordingly the table of intercorrelationsis 











presented below. 
High 
Honor | Intelli- | school | _., | Hour 
; Units | per week 
point gence | average f stud 
grade " y 
Honor points.................. 1.00 .60 .40 .22 .32 
SEE Se ee ee ee _ 60 1.00 . 36 .20 — .35 
High school average grades..... |  .40 .36 x100 .40 11 
| erie tit _ .20 .40 | 1.00 125 
Hours per week of study........ | .32 — .35 ll .25 1.00 
| 

















Since predictions require also means and standard deviations we 
present a table of them. 








ato Standard 
deviation (c) 
cc inn sinned eo. sie ke 6 eeatel 18.5 11.2 
DE cass ssid dhaokson qakananes+9Seas | 100.6 15.8 
High school average grades................... 79 7.5 
 tcwntabls es 004 breden ped Kes Veta 16.1 1.5 
Hours per week of study..................... 24. 6.0 











With these data we are ready to see how well we can predict 
intellectual achievement from the above mentioned factors. The 
formula for predicting X from Y is: 


X = reve (Y —My) + M, 


The standard deviation of the errors of prediction is given by 


above when taken one at a time are: 


Honor Pornts 


. 42 Intelligence —25 
.59 High school average — 28 


1.61 Units —8 


Se = or V1 — rey 


The formulas for predicting honor points from the factors given 


¢ OF ERRORS OF PREDICTION 


8.9 


eeeeeeeeeee eee e eee ee eee eeeeereee 
eee ee eee eer wee eeaeeeeee 


.58 Hours per week +4 


lal 


~_— he OO oe 





Predicting Academic Success 435 


For example, if a student has a score of 140 in intelligence his most 
likely number of honor points as predicted from his score will be 
A2 X 140 — 25 = 34. The 8.9 tells us that the chances are about 
2 to 1 that his honor points will be between about 25 and 43; or in other 
words, it simply tells us that of all those who score 140 in intelligence 
their mean number of honor points is 34 and standard deviation of , 
their distributed honor points in 8.9. Such predictions are not worth 
very much for individual cases but are better than nothing. 

When these factors are combined into a single equation the relia- 
bility of the prediction is increased. The general form of such an 
equation is stated at the beginning of this paper. The problem, of 
course, is to find the best system of weights to attach to the factors so 
as to get the most reliable instrument of prediction. The statistical 
labor involved in doing this is great especially when there are more than 
three factors. As the number of factors increase the work multiplies 
enormously. Several statistical short cuts have been proposed. The 
method of correlation determinants was used here.' 

When we put all of the factors which we have measured into a 
single equation and give to each its ‘‘best”’ weight so that the com- 
posite will yield the maximum correlation with honor points, we 
obtain the following: 

Honor points = .58, intelligence + .14, average grades — 1.03, 
units +1.10, time —62 (1) 

Thus if a student has an intelligence score of 120, and a high school 
average of 75, and offers 15 units for entrance, and studies on the 
average 30 hours a week, his most probable number of honor points is: 


(.58 X 120) + (.14 X 75) — (1.03 K 15) + (1.10 X 30) —62 = 36 


The standard deviation of the errors of prediction is given by the 
formula 








Trp 1 — R?2 in which R is the correlation 


between the right and left hand sides of the above equation or, what 
amounts to the same thing, the correlation between the actual number 
of honor points obtained and the number predicted by this combination 
of factors. R is obtained directly from the formula 


D 
Dy 


in which Doo is the correlation determinant and D,, is its first minor. 


1 Kelly, T.: “Statistical Method,” Chap. XI. Also Memoirs of the National 
Academy of Science,” Vol. XV. Part 3, Chap. II. 

















EEO 


4 
: 
& 
1 
i 
‘ 
bo 
i 


436 The Journal of Educational Psychology 


In this case R = .+84. The standard deviation of the errors of 
prediction is 6.0 and the probable error of prediction is .6745 times 
the standard deviation, or 4.04. Thus, in the above case where 
the most probable achievement in honor points is 36, the chances 
are even that they will not be less than 32 or more than 40. 

It is obvious that the reliability of any agency of prediction is 
measured by the size of the errors; and the errors depend on the degree 
of correlation between the agency and the thing predicted. Just as 
the reliability of an intelligence examination depends on the correlation 
between the aggregate of tests and some “‘criterion”’ of intelligence 
and not on the number of tests or the length of them; just so do pre- 
dictions of this sort depend on the correlation and not on the number 
of factors entering into the composite. Accordingly we find that we 
can eliminate the factor of units from our equation and still get a 
correlation of .838, whereas leaving units in we only get .84. And 
furthermore the PE of the errors of prediction rises very slightly. 
But when we eliminate units we have new equations with different 
constants; 


Honor points = .55, intelligence +.083, average grades +1.06, 
time —70. (2) 


The low weight attached to High School averages indicates that this 
factor, too, might be eliminated without doing serious damage to our 
instrument of prediction. When we do this the equation becomes: 


Honor points = .62, intelligence +1.2, time —70. (3) 
R = .825 ando, = 6.3 and PE = 4.19 


Thus by knowing the intelligence of a student and knowing the 
time he spends at study we can predict his honor points with but 
slightly greater error than we could if we knew also the units he offered 
and the average grade he madeinthem. This all comes about because 
of the negative correlation between time spent at study and scores in 
intelligence tests. It is this same negative correlation that causes 
grades to have such a low weight in equation (2) and units to have a 
negative sign in equation (1). Indeed, this negative correlation is one 
of the most significant facts of the whole study. 

Unfortunately, the factor of application or industry as measured 
by time spent in study is of no use in predicting the success of a 
candidate for entrance to college simply because we have no measure of 
it until the student has been in college for at least a few weeks. Leav- 
ing it out of the equation we have: 


He 


mom me me et. 


a et he mee hone SS he ee ee ae 





we SS 


= ™M 


Predicting Academic Success 437 


¥ 


Honor points = .37, intelligence +29, average grades +.27, units 
—A47. (4) 
R = 6Aandoe, = 8.6 

It is interesting to note here that intelligence alone is almost as 
good a means of prediction as the three factors in equation (4) for 
intelligence alone correlates .60 with honor points and the standard 

deviation of the errors of prediction is but slightly higher—8.9. ~ 
So far everything points to the conclusion that intelligence and 
industry are the two most important factors in academic success in 
comparison with which high school preparation does not count for 
much. This hypothesis can be tested further by the partial correla- 
tion technique. The partial correlation between honor points and 
units with intelligence constant is +.127; with average grades constant, 
is +.071, and with time spent in study constant is +.043; with intelli- 
gence, high school averages, and time spent in study all three constant 
is —.218. All of this means that there is no causal relationship be- 


tween the number of units a student offers for entrance and the number ~ 


of honor points he will obtain in the first semester, so far as our data 
are reliable. It does not mean, on the other hand that a student with 
zero units is just as likely to succeed as a student with an infinite 
number, simply because the scale of units with which we are dealing 
runs from 14 to 23 and not from zero to infinity. To state the case 
precisely we should say that a student offering 14 or 15 units has the 
same chance of success as a student offering 20 or more units, other 
things being equal. The “other things”’ are intelligence, industry, etc. 
However, passing now from facts to speculations the writer is of the 
opinion that even if the units distribution ranged from say 5 to 50 units 
the correlation with honor points would not rise above .35 or .40 and 
that the same low partial correlations would obtain when intelligence 
and industry are kept constant. 


Aside from statistical treatment there are other considerations that | 


lead us to believe that the amount of preparatory school or high school | 


work is a very poor index to academic success. High school graduates - 


who come to college with three years of foreign language are in many 
instances unable to enter the fourth-year course in that language with 
a college-trained group but are forced to drop back and take the 
“‘college”’ third year which they tell us is usually equivalent to the high 
school fourth year. This seems to be true not only in languages but also 
in sciences and other subjects. For example, students who have had a: 
year of physics in high school are seldom able to take advanced courses 


| 
| 











oe uni te a ll 
paord_- pra sane SS 


= ; 
oe a oat ge ae ar 


ct i Nl a i il i le i i 


Ol eye ee ae 


Step pies Be tra ete 


Siamese PA Rhee OO EI 


we ee s : 


= eaten 


fete. shee aoe dare Sten de : 


438 The Journal of Educational Psychology 


in college without first taking the elementary course. Indeed, there 
are now and then cases in which the student seems to be worse off by 
having taken the high school course at all. 

To go back again to the statistical data the equation for predicting 
honor points from units alone is: 


Honor points = 1.61, units — 8 
R = .22 and o, = 10.6 PE = 7.15 


According to this a student may offer as few as 10 units and still have 
an even chance of making enough honor points to stay in college (9). 
By plotting the percentage of students who fail at the different levels 
of units and extending the line on down by extrapolation we find that 
around 8 units about 50 per cent would pass and 50 per cent fail. On 
the other hand, according to the above formula in order for a student 
to be reasonably certain of success, he would have to offer at least 30 
units. When we substitute 15 units inthe above equation we find that 
the most probable number of honor points is 16 plus or minus 7. It so 
happens that 16 is just enough honor points to keep a member of this 
group off probation. 

This suggests the question of the basis of the traditional standard 
of 15 units for admission to college. Why 15 rather than 10, or 20, 
or some other number? The answer seems to be that after many years 
of experience in several institutions 15 units seems to be about right. 
Furthermore, four years of high school work at the rate of five subjects 
per year yields 20 units of work, but the presumption is that on the 
average 5 of these units will be the wrong kind. It would be an 
interesting scientific study to determine the minimum number of units 
a student could offer and still have a fair chance of success. The data 
here seem to indicate that we could afford to drop the units requirement 
to say 10 units, and then rigidly apply other means of selection. 

The factor of high school averages turns out to be somewhat more 
significant than units when tested by the partial correlation method. 
The partial correlation between high school grades and honor points 
with intelligence constant is +.246; with time spent at study constant 
is +.388; with units constant is +.348; with both time and intelli- 
gence constant is .+318. This would indicate that there is some 
causal connection between high school averages and academic success. 
Hence we may assume, until we have evidence to the contrary, that the 
quality of high school work as measured by grades, is a factor of success 
in college, although a relatively minor factor. 











it 


+ 


ano ct © 


RQ be 


Tr Dw OO MQ « 


\v 


\e s cr We 


Predicting Academic Success 439 


That general intelligence is the most important single factor in | 
academic success there is no doubt. When we apply the partial 
correlation method of analysis the results are surprising. The partial 
correlation between intelligence and honor points with time spent in 
study constant is +.805; with high school averages constant+.532; 
with units constant+.59. The high partial correlation of +.805 is due 
to the negative correlation of —.35 between intelligence and time spent 
in study. The brighter they are the less they study. Thus some 
bright students will make low grades and some dull students will make 
high grades by sheer industry. If application to work were propor- 
tional to ability, then all correlations between college grades and intelli- 
gence scores would rise much higher. : 

The general conclusion from all this is that the most reliable means 
of predicting academic success is a combination of intelligence and 
degree of application. Since we cannot know the factor of application 
in advance of the student’s admission to college, it cannot be used for 
the purposes of prediction except in a limited way. If astudent has an 
intelligence rating of X and wants to know how many honor points he 
will receive if he studies Y hours per week, we can tell him by making 
the proper substitutions in equation (3). 

If on the other hand, the student wants to know how much time he 
must spend in study in order to receive a certain number of honor 
points we must use the other regression. 


— 


Time = .45, honor points —.326, intelligence +50. 


The PE of the errors of prediction is 2.7 hours per week. Suppose 
an individual has an intelligence rating of 90 and wants to make 16 
honor points he must study on the average of 28 hours a week, plus 
or minus 2.7. This seems to be the most practical and most reliable 
kind of prediction that we can make. For easy reference and every 
day use we can construct a table from the above equation which will 
aid in giving students advice as to how much studying they should do 
in order to receive a given number of honor points. Of course, this 
sort of thing has limits and will work only in a restricted range. 
Table for predicting the number of hours per week a freshman must 
study during the first semester in order to obtain a given number of 
honor points when his intelligence score on combination of tests here 
used is known. | 
The three upper rows of this table are given here merely to illustrate 
the method. 








eet oa freee ee : peat = 
hen ae ee es ner? ghey ates 


Seger nase ne ee 


a eee 7 o 





ae A Sry OO ee AOA AIOO > Crees 


ene ae I EE OE ENE 


ale ent nan ae 


pe yee <8 


eA 


ee eee 


ose 


Whee ee Mi 


eee 


440 The Journal of Educational Psychology 


Honor Points 





Intelli-| 15 | 18 | 21 | 24 | 27 | 30 | 33 | 36 | 30 | 42 | 45 | a8 
gence 





140 11.1 | 12.5 | 13.8 | 15.0 | 16.5 | 17.8 


19.2 | 20.5 | 21.9 | 23.2 | 24.6 | 25.9 
135 12.7 | 14.1 | 15.4 | 16.8 | 18.1 | 19.5 | 20.8 | 22.2 | 23.5 | 24.9 ; 26.6 | 27.6 
130 14.4 | 15.7 | 17.1 | 18.4 | 19.8 | 21.1 | 22.4 | 23.8 | 25.2 | 26.5 | 27.8 | 29.2 









































If for example, a student has an intelligence score of 140 and wants 
to make 15 honor points he must study on the average of 11 hours per 
week, plus or minus 3. 

This sort of prediction is valuable for students who have been 
admitted to college. It has little value as a means of selection because, 
as stated above, the factor of application is not known. 

Looking to the future we may profitably inquire as to what should 
be known in order to predict academic success with a reasonable degree 
of certainty. By ‘‘reasonable’’ we mean that the PE of prediction 
should not be more than 3 honor points. If the distribution of honor 
points were normal the standard deviation would be about one-sixth of 
the range, or 8 honor points. As a matter of fact nearly all distri- 
butions of academic success are more or less skewed which means that 
the standard deviation is more than one-sixth of the range. Thus with 
a standard deviation of 8 to 10 honor points the correlation with the 
agencies of prediction must be at least .90 in order that the standard 
deviation of the errors of prediction be reduced to 3 honor points. In 

the last analysis we are seeking a combination of traits and elements 
which will correlate as much as .90 with academic success. Such a 


- correlation will probably not be obtained until we can measure some of 


the’ more or less intangible traits of character and_ personality. 











Ss —- © mee lw 


> 


es i on i ee ne ne 


A COMPARISON OF IQ’s OBTAINED WITH DEARBORN 
GROUP TESTS AND THE STANFORD REVISION 


FRANK 8. FREEMAN 
Supt. of Education, State Infirmary, Tewksbury, Mass. 


The accompanying table shows the intelligence quotients obtained 
by the use of the Dearborn Group Tests and the Stanford Revision of 
the Binet on the same group of children. The mental ages are not 
given here because they would contribute little, inasmuch as the tests 
were given at different periods, varying from one year to one and one 
half years. 

It is interesting to note that of the 75 cases here quoted 46 show 
the IQ’s of the Dearborn tests to be higher than those of the Binet by 
from one to ten points, excluding a few exceptional cases which merit 
special consideration. There are five cases which show no difference 
at all, and 24 which give a smaller Dearborn IQ. 

Even the same test given by two different examiners will generally 
show a small difference in the IQ because of the personal equation of 
the examiner, the attitude and the general condition of the child on 
the different days. This factor will account for the differences of 
most of the cases. The preponderance of larger Dearborn IQ’s, as 
well as the fact that it has been observed that group test 1Q’s are 
generally slightly higher than those of individual examinations, is 
believed to be due to the greater language element in the latter type. _ 

Now cases 70, 71, 72, 73, 74, and 75 show fairly large differences 
and need explanation, although there are only two (Nos. 70 and 74) 
which show such variations that, the Dearborn IQ’s place the girls in 
one category and the Binet IQ’s place them in an entirely different 
one. Number 70 is that of a girl whose hearing is defective, and 
although the Binet examiner was quite likely aware of this fact it is 
not felt that the handicap was entirely overcome. In giving the 
Dearborn test the girl’s teacher repeated the questions for her par- 
ticular benefit and in the manner to which she was accustomed to 
hearing her directions. Furthermore this pupil’s school work bears 
out the mental age of 6-9 obtained with the group tests. 

Numbers 71, 72, 73, and 75 show variations which can be explained 
by the language factor. These four girls had just begun their schooling 
at the time the Binet tests were given, and, therefore, had not the 
language ability which is developed through schooling and which is a 
rather essential factor in the Binet. Jt is to be noted, however, that in 

441 








ocean eet alin ana 


Ape ne 








etn ened gp atin TE SO 5, Rte nem > treat PE ah 


re ees Re 


ee _——— 


ere re 
o 


Baa 
haa 
7 tae 
ES See 
e trae 
ve 
RA 
’ ig Q 
vt ae 
1a 3 
1% 4 
M3 
, aoe 5 
eg 

: 
t 


442 The Journal of Educational Psychology 


none of these cases does the IQ of either test place the pupil in a category 
different from that designated by the other. 

Number 74 shows in the Dearborn tests a mental age of 11-6 with 
an IQ of 82; in the Binet the same girl shows a mental age of 9-2 and 
an IQ of 57,— a considerable discrepancy. This pupil was doing quite 
satisfactory Grade V work, so that it is felt the mental age of 11-6 
obtained with the group tests is more nearly correct that that of 9-2. 
This large difference is probably due to the new surroundings under 
which the Binet was given, a strange examiner, and an unwillingness 
to extend the maximum of cooperation during the examination. It 
is hardly likely that there has been a special deterioration in the last 
year because the girl is still doing essentially the same grade of work, 
both class room and manual. 

No general conclusions can be based upon these 75 comparisons, 
but they do offer a serious case for the advisability of using group 
tests even among low grade mentalities, as these obviously are. 
Group tests among this type of child have their advantages, among 
them being the vastly reduced nervousness,— the greater ease with the 
child works when not under such very close scrutiny as in individual 
tests. This, it has been observed, is a factor of considerable impor- 
tance. 

It has been said by some that for low mentalities group tests are 
not satisfactory, but after examining 250 mentally defective children 
with such tests, and finding a high correlation with actual school 
ability it appears that they may well be used among sub-normals as 
well as among normals. 

Not only does this comparison indicate a high correlation of the 
Dearborn group tests with the individual test, but it contributes 
towards the belief in the constancy of IQ’s among low grade mentali- 
ties, the average difference here being for all cases 4.6 points; for the 
positive 5.6 points, and for the negative 3.7 points. 


ry 


ith 
nd 
ite 
1-6 


ler 
ESS 

It 
ast 
rk, 


ns, 
up 
re. 
ng 
he 
1al 
or- 


ire 
en 
01 


he 
7e8 
li- 
he 








A Comparison of IQ’s 443 
TABLE 1 
Case No Dearborn | Binet | Differ- — Dearborn | Binet | Differ- 
‘ IQ IQ ence J IQ IQ ence 

1 31 34 —3 39 61 53 8 
2 52 48 4 40 85 80 5 
whe 54 46 8 41 72 67 5 
4 62 66 —4 42 72 75 —3 
5 50 53 —3 43 64 58 6 
6 55 52 3 44 81 76 5 
7 48 44 4 45 61 59 2 
8 53 49 4 46 78 76 2 
9 47 45 2 47 70 71 —1 
10 57 65 —8 48 82 88 —6 
11 48 50 —2 49 74 85 —I11 
12 67 63 4 50 65 68 —3 
13 79 80 -1 51 59 53 6 
14 59 54 5 52 70 66 4 
15 72 67 5 53 64 66 —2 
16 60 58 2 54 54 51 3 
17 62 56 6 55 50 50 0 
18 75 70 5 56 61 60 1 
19 57 58 -—1 57 57 52 5 
20 61 55 6 58 67 67 0 
21 52 51 1 59 67 61 6 
22 68 68 0 60 55 50 5 
23 91 87 4 61 71 71 0 
24 81 77 4 62 70 66 4 
25 74 70 4 63 57 62 —5 
26 72 75 —3 64 42 44 —2 
27 76 81 —5 65 47 47 0 
28 60 58 2 66 71 65 6 
29 77 74 3 67 51 52 —l 
30 75 77 —2 68 54 52 2 
31 70 63 7 69 39 46 —7 
32 60 54 6 70 62 45 17 
33 70 72 —2 71 50 40 10 
34 67 68 —1 72 57 47 10 
35 70 61 9 73 63 51 12 
36 46 50 —4 74 82 57 25 
37 70 67 3 75 51 42 10 

38 45 53 —-8 





























NN LS = <<“ -  — -—_ - -— -_-  ~ 





RS TORI BEAT OTOP mag AE 


ieee? 


PO I CIO Ry 


1 
¥ 
o 
” 
- 


+ tgp 


eet SOROS 


isi pam te 


jure ioe 


Hy Sane 


rie - - 
PROS LOR RE RE SO Ae ie 7 


Jaa, 


yea 





NEW PUBLICATIONS IN EDUCATIONAL 
PSYCHOLOGY AND RELATED FIELDS OF 


met EDUCATION ~~ 











DEPARTMENT IN CHARGE OF LAURA ZIRBES! 


1. Mechanical Ability Measured.—Data and measuring instruments 
made accessible by such advances into the realm of mental measure- 
ment as the new Stenquist tests,? may lead us to reconsider and revise 
current definitions of general intelligence and recognize the specific 
nature of mental endowment as clearly as we have come to recognize 
the need for specific training. Extended experimentation has 
resulted in three series of assembling tests involving the use of numer- 
ous actual models, and two tests of mechanical aptitude in which 
pictures are used. All the tests are designed for use with groups or 
classes. The McCall method was used in scaling and eliminating 
elements. The test results were not only correlated with other criteria 
of mechanical ability but also with the pooled results of six group 
intelligence tests. While the data are amply indicative of the signi- 
ficance and validity of the new measures, the correlations with so called 
‘general intelligence” are so low that they exhibit the inadequacy and 
injustice of mental classifications based on the sole use of instruments 
which disregard this field of mental activity. 

In the Stenquist tests the difference between the median score 
(or norm) of adult (army) men and that of fifteen-year-old boys, is 
greater than the difference between medians (or norms of adjacent age 
groups between twelve and fourteen (boys). A few partial records 
for girls and adult women show further differences which make one 
wonder to what extent differences in opportunity, incidental training 
or restricted experience are reflected in the performance of individuals 
of either sex. To reduce the effects of such variable factors if for no 
other reason, would it not be feasible to devise and standardize a 
battery of tests worked out on the basis of a definition of intelligence as 
aptitude in learning, or relative ability to profit by controlled learning 





1 All unsigned reviews were prepared by Laura Zirbes. 

2 Stenquist, J. L.: Measurements of Mechanical Ability. New York, Teachers 
College, Columbia University, Contributions to Education No. 130, 1923, pp. ix + 
101. 


444 














be | 


—_— ened wy 


New Publications 445 


experiences? A study of the variations in the differences between 
“before and after’? measurements or of variations in the amount of 
controlled practice required to reach a given degree of proficiency might 
furnish a significant criterion or index of innate capacity or aptitude. 
Meanwhile, a realization that the scores reflect training as well 
as capacity should temper criticism and influence interpretations. 

2. Of What Use Are Common People?—This question is the captious 
title! of a recent addition to the controversial literature on the topic 
“Democracy and the IQ.”’ A perusal of the chapter headings will 
suffice to interest any student of social or educational psychology. 
Those who are acquainted with this author’s critical propensities from 
other writings over his signature or nom de plume will not wish to miss 
his discussion of such topics as the following: ‘‘ Education and Com- 
monalty, Mutinous Minorities, Vanity of Manners, Philosophy and 
Mediocrity, Reproducing Their Kind, Democracy in Practice.” 
According to the Prophet ‘‘ Ezekiel,’’ profound pessimism as to the 
soundness of democratic institutions and ideals is not warranted, at 
least not while the ‘‘mugwumps have nothing better than democracy 
to offer America.”” From the concluding chapter which, by the way, 
is written in true apocalyptic style, mixed metaphors modernized, but 
otherwise unabated we quote the following choice bits (pp. 255-246). 

“Visionaries see reflected, by mirage, the perfect state and reach 
out for it just as a baby stretches forth his hands for the moon. 
Man, to inhabit Utopia would need to check his human element on 
entrance . . .Moreover, man, despite his physical, mental and 
spiritual limitations, is capable of developing democracy so that the 
government of the United States as it exists today will seem as differ- 
ent from what it will then be, as the treadmill is from the dynamo 

. The appeal of target practice lies largely in the difficulty of 
making a high score. . . The ideal of democracy may be regarded as 
the bull’s eye of a target . . . The bull’s eye of democracy is a 
government in which the interests of all the people are pooled so as to 
create a community of interests and, in turn, the acceptance of this 
community of interest as a touchstone by which to test all govern- 
mental activities.” " 

3. Two Texts on Mental Tests.—Although these two volumes are 
on the same subject they are surprisingly dissimilar. Dickson’s? 


1 Buchholz, Heinrich E. (Ezekiel Cheever): ‘‘Of What Use are Common 
People? A Study in Democracy.” Baltimore, Warwick & York, Inc., 1923. 

2Dickson, V. E.: ‘‘ Mental Tests and the Classroom Teacher.’’ Yonkers-on- 
Hudson, New York, World Book Company, 1923, pp. xiv + 231. 








2 ms gion aa Seen tal et RE 


oS es Ark ee oe 


ct bn TE EI 


> o 
PRO REE A LLORES PEE p 


é og ae : 
at), 2, one eer er PO A fe, 


~ st my ae te 7 


aan ad 


ta 
4, 
4 
-_ 
s 


446 The Journal of Educational Psychology 


intensely concrete discussion and his numerous practical suggestions 
show that the work is an outgrowth of experimentation and varied 
and extensive experience with mental tests. His solutions of a host of 
specific problems which he raises are significant because they have 
already been put to pragmatic tests in the large school systems from 
which most of the statistical data were secured. Well selected and 
readable tables, charts and figures should wither prejudice and dispose 
the reader to take more kindly to the somewhat didactic prescription 
and authoritative tone. The concise summaries, selected references, 
quotations from case studies, analytical outline, the subheadings, 
paragraph heads and also the index add to the pedagogical value of the 
book as a text and should recommend it to those who need a means of 
securing enlightened teacher cooperation in working out pertinent poli- 
cies in school systems. Terman’s editorial introduction recognizes 
the qualifications of the author in this respect. 

The Hines monograph,! edited by Suzzalo, is much briefer, more 
discursive, contains only general references to investigators, with no 
quantitative data but instead, numerous references to and quotations 
from the writings of psychologists and publicists who have partici- 
pated in the construction and criticism of mental tests. This text is 
more historical and informative. Those who have no practical interest 
or need for more precise data will perhaps get from this briefer analysis 
the necessary background for a more intelligent grasp of the debatable 
points involved in current discussions of the subject. A comparison 
of the outlines of the Dickson and Hines texts will make these differ- 
ences stand out clearly and help those who must make a choice. But 
laymen who intend to contribute to the discussion will do well to read 
Hines, study Dickson and ponder well over the references in both books 
before they take their pens in hand. 

4. Verbalism and Formalism versus Vitalized Instruction.—In this 
little three chapter monograph’ the author exhibits instances to 
support the contention that more than 50 per cent of our teaching 
results in a relatively meaningless repetition of words. He analyzes 
the causes of over-verbalism and examines in detail the significance 
of each of the following contributing factors: (1) The isolation of the 
school, (2) the symbolic nature of the curriculum, (3) the passivity of 


1 Hines, H. D.: ‘‘Measuring Intelligence.” Boston, Houghton Mifflin Com- 
pany, 1923, pp. xii + 146. 

2 Ruediger, W. C.: ‘‘ Vitalized Teaching.” Boston, Houghton Mifflin Com- 
pany, 1923, pp. xiii + 110. 





the 
plie 
doe 
the 
alig 
In ] 
atte 
eva 
enl 





New Publications 447 


the child, (4) the limitations of the teacher. He mentions the com- 
plications caused by overcrowded curricula and the dominance of the 
doctrine of formal discipline, and then proceeds, first to a statement of 
the principles which underlie the solution of the problem and then an 
alignment of practical ways and means based on sound principles. 
In passing, he criticizes a variety of cure-alls and shibboleths, and calls 
attention to the place of sound psychology and pedagogy in a critical 
evaluation of means and methods of exhibiting subject matter and 
enlisting pupil activity. 

This book should prove valuable in the training of teachers, first 
because it is constructive notwithstanding its critical tenor, and second 
because the content is so lucid and well-organized that it should be 
readily assimilated. The appended outline should also prove exceed- 
ingly helpful to student and instructor. 

5. Mental and Educational Tests in Higher Education.—Here we 
have a book’ (1) a description of “New Plan of Admission,” adopted 
by Columbia University in 1919, and a critical discussion of the Thorn- 
dike Examination and data secured by its use and (2) an analysis of 
the factors of college success and of the principles of educational 
measurement upon which new types of content examinations in 
college subjects have been constructed. A large number of sample 
examinations included illustrate the application of these principles, 
the variety of technics used, and the nature of the data obtainable 
through the use of such tests in Physics, Zoology, Government,. 
History, Economics, Philosophy, Architecture, Hydraulics and 
English. 

The college entrance board examination\in plane geometry is 
criticized and an array of sample questions for a more valid examina- 
tion in Geometry is submitted for consideration. In the appendices 
are sample questions from the Thorndike examination, data on its 
separate parts, and suggestions for its improvement. Instead of 
merely writing the usual brief editorial introduction, Dr. Terman has 
contributed the first chapter of the book in which he challenges. 
institutions of higher learning to undertake a serious type of personnel 
research and outlines the functions of a bureau charged with that task. 

6. On the Psychology of Early Childhood.—This book by an English 
author? will prove helpful to parents and others charged with the 





1 Wood, B. D.: “Measurement in Higher Education.”” Yonkers-on-Hudson, 
New York, World Book Company, 1923, pp. xi + 337. 

2 Drummond, M.: “Some Contributions to Child Psychology.”” New York, 
Longmans, Green and Company, 1923, pp. viii + 151. 


we 
ene ei ~* 





- 7S alba. adie Gace 


fe ee ae, 
Se Sy 











5 MEP EE - 


a 
ti 
4 


nept S 


I eal se Rtee hens 


ae 


Ne ettigere Ais T te Sx 


448 The Journal of Educational Psychology 


educational care of the very young child. Chap. 1 is entitled Cradle 
Education and gives interesting comment of the nature of the child’s 
earliest reactions and advice as to the educational and psychological 
significance of the appropriate use of such data. Succeeding chapters 
are also replete with instances drawn from the careful and continued 
observation of cases. It is the author’s contention that many of the 
maladjustments to reality manifest in mature persons may be due to a 
rather general tendency to overstimulate and misdirect imaginative 
tendencies in infancy and early childhood. 








